DAda-NC: A Decoupled Adaptive Online Training Algorithm for Deep Learning Under Non-convex Conditions

Zhou, Yangfan; Cheng, Cheng; Li, Jiang; Ji, Yafei; Wang, Haoyuan; Wang, Xuguang; Liu, Xin

doi:10.1007/978-981-16-9247-5_6

Yangfan Zhou ORCID: orcid.org/0000-0001-5311-1482^11,12,13,
Cheng Cheng^11,12,13,
Jiang Li^11,12,13,
Yafei Ji^11,12,13,
Haoyuan Wang^11,12,13,
Xuguang Wang^11,12,13 &
…
Xin Liu ORCID: orcid.org/0000-0003-4083-4731^11,12,13

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1515))

Included in the following conference series:

International Conference on Cognitive Systems and Signal Processing

1291 Accesses

Abstract

Adam is a famous adaptive optimization algorithm for model training of deep learning. However, its weak generalization capability under non-convex conditions is still an open problem. To tackle this problem, we proposed a decoupled weight decay adaptive algorithm, named DAda-NC, for solving non-convex optimization issues and improving generalization capability. In our proposed algorithm, we use the $sign(\cdot )$ function to re-design the second order momentum of adaptive algorithm that makes our proposed algorithm converging in non-convex cases. Moreover, we respectively add a decoupled weight decay factor to the calculation of the gradient and stepsize of the proposed algorithm, which improves the generalization capability for our proposed algorithm. Finally, plenty of experiments conducted on public datasets demonstrate that our proposed algorithm outperforms other executed algorithms.

Access provided by Autonomous University of Puebla. Download conference paper PDF

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

Article 09 May 2023

Adam revisited: a weighted past gradients perspective

Article 03 January 2020

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Article 16 February 2024

Keywords

1 Introduction

Deep neural networks (DNNs) have become a significant paradigm of artificial intelligence in many fields such as industrial visual [1, 2], nature language process [3, 4], auxiliary medical [5], auto driving [6], etc. In this paradigm, deep model training of DNN is an arduous and important task because the neural network model is extremely complex. To address this problem, deep model training is generally transformed into an optimization problem that minimizes the value of a loss function. For this reason, optimization algorithms are necessary when training deep models. Moreover, to implement the optimization algorithm easily and quickly, many algorithms based on stochastic gradient descent method have been widely concerned and applied [7, 8].

Stochastic gradient descent (SGD) algorithm is a classical and simple optimization algorithm for deep neural network training. Moreover, SGD is easily applied to various deep training tasks because of its simple algorithm logic. Importantly, SGD superior to other optimization algorithms in terms of generalization ability in many applications. However, the convergence rate of SGD is unsatisfactory, so a large number of label samples are required. To speed up the convergence rate of SGD, many fast optimization algorithms with adaptive stepsize have been proposed, for instance, Adam [9], AMSGrad [10], AdamW [11], etc. In fact, these adaptive algorithms not only design the adaptive stepsize, but also fully consider the help of the historical gradient information to the convergence speed. Therefore, the adaptive algorithms have received extensive attentions in recent years.

Although the above algorithms have been applied in some scenarios, they all restricted to convex conditions. However, in many practical scenarios, the convex condition is not easy to satisfy, which will cause the performance of the above convex optimization algorithms to be greatly reduced, or even fail to converge. Consequently, the research of non-convex optimization algorithms is crucial to the wide application of DNN. To tackle the non-convex problem, many non-convex optimization algorithms have been proposed in last few decades, such as [12,13,14]. However, these methods are all fixed stepsize with slow convergence rate, and exhibit poor performance in non-convex problems. For this reason, many researchers further studied the non-convex issues for Adam with adaptive stepsize. For example, non-convex optimization for Adam [15], non-convex stochastic optimization for Adam-type algorithms [16], and Yogi [17]. It is however, whether the AdamW optimization algorithm conveges under non-convex conditions is still an open problem.

AdamW is a better optimizer than Adam with its generalization capabilities that decoupling the weight decay from the adaptive stepsize. However, AdamW is a typical convex optimization algorithm that fails converging in the non-convex case. Therefore, it is urgent to propose an optimization method to ensure that AdamW still converges in the non-convex case. In this paper, we propose a non-convex optimization algorithm, named DAda-NC, which uses decoupled method to recover the weight decay regularization of AdamW. Importantly, DAda-NC also conbines the $sign(\cdot )$ function with its stepsize to ensure that it converges even under non-convex conditions. In addition, our mainly contributions are summarized below:

We propose a decoupled adaptive training algorithm for deep learning under non-convex conditions, named DAda-NC.
We solve the non convergence problem in the case of non-convex for adaptive algorithms by exploiting the $sign(\cdot )$ function.
We further improve the generalization ability of adaptive algorithms by exploiting a decoupled weight decay factor.
We present sufficient experiments for both image classification and language processing tasks on different public datasets.

The rest of this article is structured as follows: Sect. 2 presents the notation and preliminaries of this work. Moreover, the algorithm design of DAda-NC is provided in Sect. 3. Furthermore, the results of experiments are shown in Sect. 4. We put the conclusion of this work in Sect. 5, and provide our future works in Sect. 6. Finally, we state our acknowledgements for this work in Sect. 7.

2 Notation and Preliminaries

2.1 Notation

In this paper, we use bold letters to denote vectors, such as $\mathbf {x}$ and $\mathbf {y}$. For all $\mathbf {x}, \mathbf {y} \in \mathbb {R}^d$, $\mathbf {x}^2$ denotes the element-wise square, $\sqrt{\mathbf {x}}$ represents the element-wise square root, and $\frac{\mathbf {x}}{\mathbf {y}}$ denotes the element-wise division. Moreover, $\mathbf {x}_t$ represents the value of $\mathbf {x}$ at the t-th time. In addition, $x_{t,i}$ denotes the i-the coordinate of vector $\mathbf {x}_t$.

2.2 Online Non-convex Optimization

In this paper, we consider an online learning problem. In this problem, the decision vectors $\mathbf {x}_t$ and the loss functions $f_t(\cdot )$ are different from time t to time $t+1$. Specifically, the optimizer generates a decision $\mathbf {x}_t$ for t-th time. Thus, deep model outputs a result that following the decision. Then, the adversary (i.e., the loss function) returns a regret to the optimizer based on the output result. Finally, the optimizer utilizes its strategy to update the decision $\mathbf {x}_{t+1}$ for the next time. Consequently, the optimization goal of online leaning can be formed as follows:

$$\begin{aligned} \min _{\mathbf {x}_t\in \mathbb {R}^d} R(T) = \frac{1}{T}\sum _{t=1}^T\left( f_t(\mathbf {x}_t)- f_t(\mathbf {x}^*) \right) , \end{aligned}$$

(1)

where T is a time horizon, $f_t(\cdot )$ is the loss function at time t, R(T) denotes the regret of time T, $\mathbf {x}_t$ is the decision vector at time t, and $\mathbf {x}^*$ represents the global optima solution.

In our work, we focus on an online learning problem under non-convex conditions. For this reason, we next introduce a non-convex optimization problem. Specifically, we give the following non-convex form to intuitively describe the problems:

$$\begin{aligned} f_t(\mathbf {x}_t) := \mathbb {E}_{\mathbf {s}_t\sim \mathbb {P}}\left[ \ell _t(\mathbf {x}_t,\mathbf {s}_t) \right] , \end{aligned}$$

(2)

where time $t\in \{1,\ldots ,T\}$, $\mathbf {x}_t\in \mathbb {R}^d$, $\ell $ is a smooth loss function which possibly non-convex, the vector $\mathbf {x}_t$ is model parameters of DNN at time t, $\mathbf {s}_t$ is the labeled sample for training at time t, and $\mathbb {P}$ is an unknown data distribution.

2.3 Adam

In this section, we review the design of Adam. The Adam optimization algorithm is shown in Algorithm 1. According to the original design of Adam, it is to solve the optimization problem under convex conditions. Therefore, the loss functions $f_t$ and the feasible region for Adam are both convex.

As shown in Algorithm 1, the first order moment of Adam is designed as a exponential moving average of gradient, which is formed as follows:

$$\begin{aligned} \mathbf {m}_t = \beta _{1}\mathbf {m}_{t-1} + (1-\beta _{1})\mathbf {g}_t. \end{aligned}$$

(3)

The first-order momentum in adaptive optimization algorithms is to consider the influence of historical gradient information on the current gradient, which refers to the effect of the inertia of object motion in physics. In fact, an important significance of the first-order momentum is to speed up the iteration speed of the optimization algorithm, which is similar to the second-order momentum. In particular, the second order momentum is an exponential moving average with respect to the square of the gradient, and its form is shown below:

$$\begin{aligned} \mathbf {v}_t = \mathbf {v}_{t-1} - (1-\beta _{2})(\mathbf {v}_{t-1} - \mathbf {g}_t^2). \end{aligned}$$

(4)

Importantly, the second order momentum realizes the adaptation of the step size of the optimization algorithm, thereby avoiding the invalid oscillation of the algorithm near the optimal point. Finally, Adam updates the decision vector based on the two momentums and the learning rate ($\alpha _t$), which is shown below:

$$\begin{aligned} \mathbf {x}_{t+1} = \mathbf {x}_{t} - \alpha _t\mathbf {m}_t / (\sqrt{\mathbf {v}_t}+\epsilon ). \end{aligned}$$

(5)

Although Adam has received a lot of attention and applications, its applicable condition is that the loss function is convex. Therefore, in the case of non-convex, the effect of Adam is very unsatisfactory. For this reason, we propose a novel adaptive momentum optimization algorithm, which is specifically applied to non-convex situations.

3 Algorithm Design of DAda-NC

In this section, we first present some definitions and assumptions that guarantee the usability of the proposed algorithm. Secondly, we introduce the design details for our proposed algorithm.

Definition 1

If function $\ell $ is L-smooth, then for $\forall \mathbf {x}, \mathbf {y}\in \mathbb {R}^d$, it satisfies the following condition:

$$\begin{aligned} \Vert \nabla \ell (\mathbf {x}) - \nabla \ell (\mathbf {y})\Vert \le L\Vert \mathbf {x} - \mathbf {y}\Vert . \end{aligned}$$

(6)

Definition 2

For the function $\ell $, its gradient calculation number with respect to the first parameter of an algorithm is equal to the stochastic first-order (SFO) complexity of the algorithm.

Definition 3

The SFO complexity of SGD to obtain a $\delta $-accurate solution is $O(1/\delta ^2)$.

The definitions in this paper are common in many previous similar works, such as [17, 18].

Assumption 1

This work assumes that the gradient of the function $\ell $ is bounded, i.e., for $\forall \mathbf {x},\mathbf {y}\in \mathbb {R}^d$, we have $\Vert \nabla \ell (x_{t,i})\Vert \le G$.

Assumption 2

This work assumes that the variance in stochastic gradients is bounded, i.e., for $\forall \mathbf {x}\in \mathbb {R}^d$, we have $\mathbb {E}\Vert \nabla \ell (x_{t,i}) - \nabla f(x_{t,i})\Vert ^2 \le \sigma ^2$.

The assumptions in our work are typical in many similar works that with respect to the first order momentum based optimization algorithms, such as [19, 20].

Next, we turn to particular introduce our proposed algorithm, DAda-NC. For the first step of DAda-NC, it stochastic chooses a batch of samples from training dataset and return the corresponding gradient. In order to eliminate the excessive influence of the second-order weight, we design a decoupled weight decay strategy. Therefore, we add a decoupled operation into the gradient as follows:

$$\begin{aligned} \mathbf {g}_t = \nabla f_t(\mathbf {x}_{t-1}) + \lambda \mathbf {x}_{t-1}. \end{aligned}$$

(7)

Moreover, DAda-NC uses a same update strategy as Adam for the first order momentum, which is shown below:

$$\begin{aligned} \mathbf {m}_t = \beta _{1}\mathbf {m}_{t-1} + (1-\beta _{1})\mathbf {g}_t. \end{aligned}$$

(8)

Importantly, DAda-NC applies the $sign(\cdot )$ function to the second order momentum, thus solving the non convex optimization problem. The form of the second order momentum in DAda-NC is shown as follows:

$$\begin{aligned} \mathbf {v}_t = \mathbf {v}_{t-1} - (1-\beta _{2})sign(\mathbf {v}_{t-1}-\mathbf {g}_t^2)\mathbf {g}_t^2. \end{aligned}$$

(9)

Finally, the update rule of the decision vector with decoupled weight decay is shown below:

$$\begin{aligned} \mathbf {x}_{t+1} = \mathbf {x}_{t} - \alpha _t\Big (\mathbf {m}_t / (\sqrt{v_t}+\epsilon ) + \lambda \mathbf {x}_{t-1}\Big ). \end{aligned}$$

(10)

Therefore, the details of the proposed algorithm have been shown in full. In order to demonstrate the performance of our proposed algorithm in practices, we next apply it in various deep learning tasks.

4 Experiments

In this section, we evaluate the performance of Dada-NC on different datasets and learning rate settings for non-convex conditions. In our experiments, we conduct two group tasks of deep learning, i.e., image classification and language processing. Moreover, we use one 1080Ti GPU and the Pytorch module of Python 3.6 for all model training tasks.

4.1 Datasets and Parameter Settings

In our experiments, we use different public datasets for image classification task and language processing task, respectively. The summary of datasets utilized in our experiments are shown in Table 1. In the image classification tasks, we use a famous public dataset, i.e., CIFAR-10, which is a image dataset including 10 classes and 60000 images with $32\,*\,32$ size. Furthermore, we conduct three classical architectures, i.e., VGG-11 [21], ResNet-34 [22] and DensetNet-121 [23], for this dataset. For the language processing tasks, we use a famous language dataset, Penn Treebank, with 1, 2, 3-Layer LSTM architectures, respectively. Moreover, the Penn Treebank is a dataset obtained by tagging the corpus, and its tagging content includes part of speech tagging and parsing. In addition, the corpus of the Penn Treebank data set comes from the Wall Street Journal in 1989, and its word size is 1M extracted from 2499 articles.

Table 1. Datasets and architectures used in our experiments.

Full size table

We compare our proposed algorithm with the classical adaptive optimization algorithm Adam and the latest adaptive non-convex optimization algorithm Yogi. We set all the algorithms executed in our experiments as follows:

Adam [9] is with the stepsize: $\alpha = 0,001$, the coefficient of first order momentum: $\beta _1 = 0.9$, the coefficient of second order momentum: $\beta _2 = 0.999$, the regularization item: $\epsilon = 10^{-8}$. Moreover, the initialization momentums are set as $\mathbf {m}_t = 0$ and $\mathbf {v}_t = 0$.
Yogi [17] is with the stepsize: $\alpha = 0,01$, the coefficient of first order momentum: $\beta _1 = 0.9$, the coefficient of second order momentum: $\beta _2 = 0.999$, the regularization item: $\epsilon = 10^{-3}$. Moreover, the initialization momentums are set as $\mathbf {m}_t = 0$ and $\mathbf {v}_t = 0$.
DAda-NC (Ours) is with the stepsize: $\alpha = 0,5$, the decoupled coefficient: $\lambda = 0.125*10^{-3}$, the coefficient of first order momentum: $\beta _1 = 0.9$, the coefficient of second order momentum: $\beta _2 = 0.999$, the regularization item: $\epsilon = 10^{-3}$. Moreover, the initialization momentums are set as $\mathbf {m}_t = 0$ and $\mathbf {v}_t = 0$.

4.2 Image Classification

In this group of experiments, we conduct the image classification task, which is a standard and focused problem in machine learning, on the public dataset CIFAR-10. Furthermore, three different neural network architectures, VGG-11 (with 11 layers), ResNet-34 (with 34 layers) and DenseNet-121 (with 121 layers), are utilized for the experiments on CIFAR-10, respectively. Next, we present the results and analysis of our experiments.

The experimental results about the loss with respect to the epoches are shown in Fig. 1. In this figure, we intuitively see that the loss of our proposed algorithm can reach the lowest of all executed algorithms at the end of the experiment. In fact, the purpose of model training is to minimize the value of the loss, thereby the extent of the loss can potentially reflect the quality of the model training results. In other words, the stable loss value output by the optimization algorithm in the training process potentially reflects the generalization ability of the algorithm. Therefore, the Fig. 1 shows that our proposed algorithm outperforms Adam and Yogi in non-convex conditions.

Table 2. The top-1 training accuracy of compared algorithms on three models, respectively.

Full size table

Moreover, the experimental results on the training loss are shown in Fig. 2. In this figure, we straightly observe that our proposed algorithm attains a higher training accuracy than Adam and Yogi. As is well-known, the index of training accuracy directly reflect the generalization ability of the optimization algorithms used for the training process. Therefore, the Fig. 2 demonstrates that our proposed algorithm performs better than Adam and Yogi on training accuracy for image classification tasks. Furthermore, we present the summary of top-1 training accuracy of each compared algorithms under non-convex conditions as shown in Table 2. The results in this table further demonstrate that our proposed algorithm performs the best in terms of the training accuracy.

Table 3. The top-1 test accuracy of compared algorithms on three models, respectively.

Full size table

In addition, another important indicator that reflects the performance of the optimization algorithm is the test accuracy after model training. For this reason, we also conduct a set of experiments on test accuracy with respect to image classification tasks. And the experimental results are shown in Fig. 3. Note that the training samples will be used repeatedly, while the test samples are usually extra, so the training accuracy will be slightly higher than the test accuracy. Nonetheless, the Fig. 3 reveals that our proposed algorithm obtain a higher test accuracy than Adam and Yogi after model training. Moreover, we compare the top-1 test accuracy of all the executed algorithms in this group of experiments, and show them in Table 3. The results also verify that our proposed algorithm outperforms Adam and Yogi in terms of the test accuracy.

4.3 Language Processing

Language processing is one of the important applications in artificial intelligence. Therefore, we also conduct another group of experiments on language processing. In this group of experiments, we utilize the classic 1, 2, and 3-layer LSTM models on the Penn Treebank dataset, respectively. Moreover, the total number of model parameters of 1,2,3-Layer LSTM are 5293200, 13632400 and 24221600, respectively. It is well known that the perplexity is an important evaluation indicator for language models, therefore, we also use the perplexity to evaluate the performance of executed optimization algorithms. Moreover, if it is known that the sentence $(w_1, w_2, \ldots , w_m)$ will appear in the corpus, then the higher the probability of the sentence calculated by the language model, the better the language model predicts the corpus. For this reason, we usually use the perplexity to characterize the predictive ability of language models. Furthermore, the perplexity P of the language model S can be formed as follows:

$$\begin{aligned} P(S)&= p(w_1, w_2, \ldots , w_m)^{-\frac{1}{m}} \nonumber \\&= \left[ \prod _{i=1}^m \frac{1}{p(w_i | w_1, w_2, \ldots , w_{i-1})}\right] ^{\frac{1}{m}}, \end{aligned}$$

(11)

where p denotes the probability.

Table 4. The results of test process on LSTM.

Full size table

The results of this group of experiments are shown in Fig. 4. Moreover, this figure firsthand show that our proposed algorithm attains the least perplexity in all 1, 2 and 3-layers LSTM models. Note that the lower the perplexity value, the better the model performance. Both Adam and Yogi perform worse than our proposed algorithm. Therefore, our proposed algorithm also has a good performance in the training of language processing models under non-convex conditions. In addition, we further show the results of the test process in Table 4. We directly see that our proposed algorithm attains the least value of loss, perplexity and bits-per-character among all the compared algorithms in the test process.

5 Conclusion

In this paper, we focus on the problem of convergence and generalization ability of adaptive optimization algorithms under non-convex conditions. In our propose algorithm, we first use a decoupled weight decay method to further improve the generalization ability. And then we re-design the second order momentum of adaptive algorithms with the $sign(\cdot )$ function that ensures our proposed algorithm converges under non-convex conditions. Finally, we conduct sufficient experiments on both image classification and language processing tasks. All the experimental results demonstrate that our proposed algorithm outperforms Adam and Yogi on standard public datasets under non-convex cases.

6 Future Work

Nonetheless, it is not known whether our proposed algorithm still has such good performance except in the case of non-convex, such as convex, strongly convex and so on. Therefore, we will explore these issues in our future works. Furthermore, the theoretically proof of convergence of our proposed algorithm is also left for our future work.

References

Ge, C., Wang, J., Wang, J., Qi, Q., Liao, J.: Towards automatic visual inspection: a weakly supervised learning method for industrial applicable object detection. Comput. Ind. 121(11), 103232 (2020)
Google Scholar
Shu, Y., Huang, Yu., Li, B.: Design of deep learning accelerated algorithm for online recognition of industrial products defects. Neural Comput. Appl. 31(9), 4527–4540 (2018). https://doi.org/10.1007/s00521-018-3511-4
Article Google Scholar
Dong, H., Wang, W., Huang, K., Coenen, F.: Automated social text annotation with joint multi-label attention networks. IEEE Trans. Neural Netw. Learn. Syst. 32(5), 2224–2238 (2020)
Google Scholar
Chen, Q., Wang, W., Huang, K., Coenen, F.: Zero-shot text classification via knowledge graph embedding for social media data. IEEE Internet Things J. (2021)
Google Scholar
Zhang, F., Li, Z., Zhang, B., Du, H., Zhang, X.: Multi-modal deep learning model for auxiliary diagnosis of Alzheimer’s disease. Neurocomputing 361, 185–195 (2019)
Google Scholar
Hu, J., Zhang, X., Maybank, S.: Abnormal driving detection with normalized driving behavior data: a deep learning approach. IEEE Trans. Veh. Technol. (2020)
Google Scholar
Zhou, Y., Zhang, M., Zhu, J., Zheng, R., Wu, Q.: A randomized block-coordinate Adam online learning optimization algorithm. Neural Comput. Appl. 32(16), 12671–12684 (2020). https://doi.org/10.1007/s00521-020-04718-9
Article Google Scholar
Zhou, Y., Huang, K., Cheng, C., Wang, X., Hussian, A., Liu, X.: FastAdaBelief: improving convergence rate for belief-based adaptive optimizers by exploiting strong convexity. CoRR, abs/2104.13790 (2021)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. Comput. Sci. (2014)
Google Scholar
Reddi, S.J., Kale, K., Kumar, S.: On the convergence of Adam and Beyond. In: Proceedings of the Sixth International Conference on Learning Representations, pp. 13–23 (2018)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Google Scholar
Reddi, S.J., Hefny, A., Suvrit, S., Póczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: Proceedings of the 33ND International Conference on Machine Learning, ICML 2016, pp. 314–323, New York City, NY, USA (2016)
Google Scholar
Zhu, Z.A., Hazan, E.: Variance reduction for faster non-convex optimization. CoRR, abs/1603.05643 (2016)
Google Scholar
Reddi, S.J., Sra, S., Póczos, B., Smola, A.J.: Fast stochastic methods for nonsmooth nonconvex optimization. CoRR, abs/1605.06900 (2016)
Google Scholar
De, S., Mukherjee A., Ullah, E.: Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration. CoRR, abs/1807.06766 (2018)
Google Scholar
Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. In: International Conference on Learning Representations, ICLR 2019, New Orleans, Louisiana, United States (2019)
Google Scholar
Zaheer, M., Reddi, S., Sachan, D., Kale, S., Kumar, S.: Adaptive methods for nonconvex optimization. In: Thirty-second Conference on Neural Information Processing Systems, NeurIPS 2018, Palais des Congrès de Montréal, Montréal CANADA (2018)
Google Scholar
Jin, X., Zhang, X., Huang, K., Geng, G.: Stochastic conjugate gradient algorithm with variance reduction. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1360–1369 (2019)
Article MathSciNet Google Scholar
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet Google Scholar
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Programm. (16), 267–305 (2014). https://doi.org/10.1007/s10107-014-0846-1
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)
Google Scholar

Download references

Acknowledgement

This work was supported by the Hundred Talents Program of Chinese Academy of Sciences under grant No. Y9BEJ11001. This research was primarily conducted at Suzhou Institute of Nano-Tech and Nano-Bionics (SINANO).

Author information

Authors and Affiliations

University of Science and Technology of China, 96 Jinzhai Road, Hefei, Anhui, China
Yangfan Zhou, Cheng Cheng, Jiang Li, Yafei Ji, Haoyuan Wang, Xuguang Wang & Xin Liu
Suzhou Institute of Nano-Tech and Nano-Bionics (SINANO), Chinese Academy of Sciences, 398 Ruoshui Road, Suzhou Industrial Park, Suzhou, Jiangsu, China
Yangfan Zhou, Cheng Cheng, Jiang Li, Yafei Ji, Haoyuan Wang, Xuguang Wang & Xin Liu
Gusu Laboratory of Materials, 388 Ruoshui Road, Suzhou Industrial Park, Suzhou, Jiangsu, China
Yangfan Zhou, Cheng Cheng, Jiang Li, Yafei Ji, Haoyuan Wang, Xuguang Wang & Xin Liu

Authors

Yangfan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yafei Ji
View author publications
You can also search for this author in PubMed Google Scholar
Haoyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuguang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Liu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Fuchun Sun
National University of Defense Technology, Changsha, China
Dewen Hu
Universität Hamburg, Hamburg, Germany
Stefan Wermter
Tsingzhan Artificial Intelligence Research Institute, Nanjing, China
Lei Yang
Tsinghua University, Beijing, China
Huaping Liu
Tsinghua University, Beijing, China
Bin Fang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Y. et al. (2022). DAda-NC: A Decoupled Adaptive Online Training Algorithm for Deep Learning Under Non-convex Conditions. In: Sun, F., Hu, D., Wermter, S., Yang, L., Liu, H., Fang, B. (eds) Cognitive Systems and Information Processing. ICCSIP 2021. Communications in Computer and Information Science, vol 1515. Springer, Singapore. https://doi.org/10.1007/978-981-16-9247-5_6

Download citation

DOI: https://doi.org/10.1007/978-981-16-9247-5_6
Published: 11 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9246-8
Online ISBN: 978-981-16-9247-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DAda-NC: A Decoupled Adaptive Online Training Algorithm for Deep Learning Under Non-convex Conditions

Abstract

Similar content being viewed by others

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

Adam revisited: a weighted past gradients perspective

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Keywords

1 Introduction