WSAGrad: a novel adaptive gradient based method

Verma, Krutika; Maiti, Abyayananda

doi:10.1007/s10489-022-04205-9

WSAGrad: a novel adaptive gradient based method

Published: 26 October 2022

Volume 53, pages 14383–14399, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

WSAGrad: a novel adaptive gradient based method

Download PDF

443 Accesses
2 Altmetric
Explore all metrics

Abstract

The vanishing gradient problem under nonconvexity is an important issue when training a deep neural network. The problem becomes prominent in the presence of sigmoid activation. It stops the learning task, which prevents further improvement in the performance of an algorithm. SGD and ADAM are two popular methods frequently used to train deep networks. However, their performance deteriorates in the presence of the vanishing gradient problem. In this paper, we present WSAGrad(An A daptive Grad ient method with W eighted S ine based step size). It uses a trigonometric function over the exponential moving average of the weight parameters to compute the step size. This is the first work to use a trigonometric function in the calculation of step size. The trigonometric function combined with weight parameters, has the following significance: (1) it will not fade away when the gradient vanishes because the moving average carries information from the past iterations, and (2) as the weight parameters are tuned with each iteration, it approximately shapes the step size based on the geometric properties of the data. Additionally, two new parameters are incorporated to control the stability and convergence of the proposed algorithm. Moreover, we provide a convergence rate under mild assumptions. In the experimental section, we have shown that our proposed step size performs better than the existing baseline on FMNIST, CIFAR − 10, and CIFAR − 100 for the classification task.

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Article 16 February 2024

Toward Novel Optimizers: A Moreau-Yosida View of Gradient-Based Learning

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The learning efficiency of deep networks [1,2,3] allowed them to be developed rapidly, and they have been successfully used in various applications, such as machine vision [4], machine translation [5], sound recognition [6], charge prediction in lithium-ion batteries [7, 8] and many more. The deep network-based approach (FF − LSTM) in [8] establishes the groundwork for long-term state prediction of lithium-ion batteries with increasing energy management and safety. The extension of deep networks to such vivid applications shows their robustness. In addition to its practical success theoretical findings show its general competence [9, 10]. However, training deep learning models for nonconvex problems is strenuous because finding a global minimum is intractable. In the early 1990s, the authors of [11], showed that training a multilayer perceptron(MLP) is indeed NP-hard, which significantly contributed to the shrinkage of this field. While recent practical successes have revived the area, we still need to know the best minima that can be reached for deep models when trained via different optimizers.

Despite being proposed in the last century, the stochastic gradient descent (SGD) algorithm remains one of the most effective algorithms for training deep neural networks Its simplicity and efficiency make it suitable for almost all applications. However, SGD has the disadvantage of scaling the gradient uniformly in all directions. This might result in poor performance and limited training speed when data are sparse. Several variants of SGD have been proposed [12,13,14] to speed up the training process and improve performance. These variants perform well when the momentum parameter is significant. However, a large momentum induces staleness [15], implying a preference for past gradients over the current gradient. Consequently, it comes with a cost of reduced generalization capacity. Authors of [16] showed the instability of Nesterov’s accelerated method with respect to initialization and exponentially fast deterioration in the performance with the number of gradient steps. SGD’s dependency on the learning rate and the need for proper initialization for ill-conditioned problems necessitate a paradigm shift towards adaptive algorithms with provable guarantees.

A number of adaptive gradient algorithms have been proposed to address these problems. Usually, these algorithms alter the learning rate for each gradient coordinate based on the objective function’s current geometry curvature. For example, Adagrad [17] computes the adaptive learning rate by dividing gradients by a denominator equal to the root mean square of the previous gradients. A sparse gradient was reported to lead to a quicker convergence of Adagrad compared to the vanilla SGD [17]. However, Adagrad has a limited ability to generalize on unseen data [18]. In the presence of a sizeable dense gradient, the mean in the denominator makes the update small, leading to the failure of Adagrad.

Some adaptive algorithms have been proposed to address this issue by replacing the denominator with the square root of the exponential moving average of the past gradient squares, i.e., RMSProp [19], AdaDelta [20], and ADAM [21]. ADAM has become the most popular among all the variants due to its faster convergence and computational efficiency. Regardless of its faster convergence, recent theoretical studies show its instability and weaker convergence guarantee [18, 22]. Experiments conducted in [23] show that when the decay rate of the second moment accumulator is too slow, larger-than-desired updates might be provided. The generation of extreme learning rates during training is reported in [24]. The parameter 𝜖 is introduced in the denominator of the update rule of ADAM [21] to prevent the denominator from becoming zero. Its influence on the performance of ADAM was later discovered in [25].

To address the challenges of ADAM and SGD, we introduce An A daptive Grad ient method with W eighted S ine based step size(WSAGrad). Sine-and cosine-based algorithms are generally used with derivative–free algorithms [26,27,28]. This is the first work to use a sine function to compute the step size for a gradient–based method. This method captures the geometric properties of the data through weight parameters. The intuition of using a weight vector is linked with the primary idea of preserving the metric structure of the data via a deep network initialized with random Gaussian weights, as mentioned in [29]. These structures propagate along the layer, enabling robust recovery of the original data from the features computed by the network. In other words, training a deep network preserves the angle between intraclass and interclass points, which implies that every layer preserves the important information about the data [29]. Hence, using the weight vector to compute the step size of WSAGrad will incorporate this information, which is further tuned with each set of iterations. WSAGrad shows a significant improvement in performance for different deep learning tasks over other state-of-the-art techniques. A theoretical guarantee is proposed with some mild assumptions. The main highlights of this paper are as follows:

The paper contributes by proposing a novel adaptive step size based gradient descent algorithm that inherits important capabilities of both approaches such as the adaptability and generalization ability of ADAM and SGD.
The step size is constructed using a trigonometric function over the exponential moving average of the norm of the weight parameters. It is designed to operate on the vanishing gradient problem. Instead of stopping the learning task, it minutely increases the objective value so that the coming iteration can reach a better minimum. Additionally, two controlling parameters are used to nullify the scope of divergence and assure stability inside a certain region. To the best of our knowledge, this is the first work to use a network’s weight parameters in the step size.
The step size is computationally efficient due to the small memory requirement per iteration with only one extra set of auxiliary variables. Furthermore, it reduces human intervention and hyperparameter tuning, making it more robust to different optimization problems.
Under a minimum assumption of smoothness and level bound, we prove the convergence of our algorithm.
We validate the performance of our algorithm through extensive experiments. We tested the performance of our algorithm compared to that of the baseline on the task of image classification on real-world data FMNIST, CIFAR − 10 and CIFAR − 100. The experiments span basic to advanced architectures of different batch sizes with various activations types.

It is worth noting that our algorithm’s performance significantly improves with the number of iterations when most of the existing methods saturate. We also show the stability of our algorithm in all sets of experiments. For each set of experiments, we have reported the best results, the mean and the standard deviation of our algorithm’s performance. The remained of this paper is organized as follows: we discuss the literature in Section 2. Section 3 outlines the problem definition in detail, followed by the proposed methodology and a theoretical guarantee under mild assumptions. Next, we describe the dataset, experimental setup, and baselines in Section 4, followed by results and analysis under the same heading. Lastly, Section 5 concludes the paper with some directions for related future works.

2 Background and notation

2.1 Background

Training a deep neural network (DNN) for a large-scale machine learning problem is a complex task. Recently, adaptive gradient methods, a class of optimization tools, have drawn the community’s attention due to their faster convergence on such tasks. The first popular algorithm in this line of research is Adagrad [17, 30], which achieves significantly better performance than the vanilla SGD when gradients are sparse or generally small. Update equation of Adagrad is as follows:

$$ x_{t+1}=x_{t}-\eta_{t}\frac{g_{t}}{\sqrt(v_{t})}, $$

(1)

where g_t = ∇f(x_t;ξt), and $v_{t}= \frac {1}{t}{\sum }_{j=1}^{t} {g_{j}^{2}}$, and $\eta _{t} = \frac {\eta }{\sqrt (t)}$ where η > 0 is the step size. The above equation shows the learning rate for each dimension. Although Adagrad works well for sparse settings, its performance has been observed to degrade in conditions where loss functions are nonconvex and gradients are dense. The quick decline of the learning rate in these settings makes it unattractive because it uses all the previous gradients in the update. This issue is compounded even more in high-dimensional deep learning problems.

To mitigate this problem, several variants of Adagrad, such as RMSProp [19], ADAM [21], AdaDelta [20], and NADAM [31] have been proposed to handle the rapid decay of the learning rate by using the exponential moving averages of past squared gradients. Essentially, these strategies limit the update to only the past few gradients. Of these variants, ADAM is the default method of choice for training DNN s. The updated equation adopts the following form:

$$ \begin{array}{@{}rcl@{}} m_{t}&=& \alpha_{1}m_{t-1}+(1-\alpha_{1})\nabla f(x_{t};\xi{t}), \hat{m_{t}}= \frac{m_{t}}{1-{\alpha_{1}^{t}}}\\ v_{t}&=& \alpha_{2}v_{t-1}+(1- \alpha_{2})(\nabla f(x_{t};\xi{t}))^{2}, \hat{v_{t}}= \frac{v_{t}}{1-{\alpha_{2}^{t}}}, \end{array} $$

(2)

$$ x_{t+1}= x_{t}-\frac{\eta_{t}\hat{m_{t}}}{\sqrt(\hat{v_{t}})+\epsilon}, \forall t \geq 1, $$

(3)

where α₁,α₂ ∈ (0,1), and 𝜖 > 0 and $\eta _{t}= \frac {\eta }{\sqrt (t)}$, η > 0. However, to our knowledge, ADAM still has no convergence proof. The proof in the original paper [17] was shown wrong in [32, 33]. A counterexample was introduced in [18], and an improved form of (3) was introduced and named AMSGrad. It follows:

$$ \hat{v_{t}}= \max(\hat{v_{t-1}},\hat{v_{t}}); x_{t+1}= x_{t}-\frac{\eta_{t}\hat{m_{t}}}{\sqrt(\hat{v_{t}})}. $$

(4)

However, in the experiments, AMSGrad does not show improvements [34, 35]. In contrast, it sometimes results in a worse accuracy than ADAM. For ADAM-type optimizers, convergence is studied in [35] for nonconvex problems, and the authors reported that a minimum could be achieved with no guarantee of staying there. Another work [36] used a dynamical system viewpoint to interpret the ADAM optimizer. This approach uses a non-autonomous ordinary differential equation without the Lipschitz condition. Subsequently, to improve the generalization performance of ADAM, another variant with a convergence guarantee, namely, AdamW, was introduced [37]. It is defined as follows:

$$ \begin{array}{@{}rcl@{}} g_{t}&=& \nabla f(x_{t};\xi{t})+\lambda x_{t}, m_{t}= \alpha_{1}m_{t-1}+(1-\alpha_{1})g_{t},\\ \hat{m_{t}}&=& \frac{m_{t}}{1-{\alpha_{1}^{t}}}, v_{t}= \alpha_{2}v_{t-1}+(1- \alpha_{2})(g_{t})^{2}, \hat{v_{t}}= \frac{v_{t}}{1-{\alpha_{2}^{t}}}, \end{array} $$

(5)

$$ x_{t+1}=x_{t}-\eta_{t}(\alpha\hat{m_{t}}/(\sqrt(\hat{v_{t}})+\epsilon)+\lambda x_{t}), $$

(6)

where, α₁,α₂ ∈ (0,1), α > 0,λ > 0 and 𝜖 > 0. In [38], under a convex setting, the authors proposed stabilizing the coordinatewise weighting factors to ensure convergence. PADAM [39] has been introduced with a partial adaptive parameter for a better generalization, which changes the coordinatewise weighting factor. The convergence rates of the original ADAM and RMSprop under the full-batch (deterministic) setting are provided in [40, 41]. AdaBelief [42] has been proposed to obtain a good generalization by adopting the step size according to the ‘-belief-’ in the current gradient direction. All the equations are similar to ADAM except for v_t, which is defined as follows:

$$ v_{t}=\alpha_{2}v_{t-1}+(1-\alpha_{2})(\nabla f(x_{t};\xi{t})-m_{t})^{2}+\epsilon. $$

(7)

More recently, some accelerated adaptive gradient methods [43, 44] have been proposed based on variance-reduced techniques. In particular, the authors of [44], proposed SUPER − ADAM, a faster and universal adaptive gradient framework based on a universal adaptive matrix. Recently, an adaptive variant of Nesterov accelerated gradient descent was introduced in [45], which replaces the constant momentum with an adaptive momentum. To assure stability, the momentum resets to zero according to a scheduler. In Table 1, we list some of the most recent algorithms and their convergence rates.

Table 1 Convergence rates of different algorithms: here, T denotes the number of iterations, and b denotes the mini-batch size

Full size table

This paper attempts to design an algorithm that works well with a minimum set of resources and models. Rather than focusing on the moving average of the gradient as most of the previous algorithms have done, our work depends on the weight parameters. The idea is that with each iteration, we obtain weights that are more tuned than the previous weights. An adjusted weight with a step size as defined in [46] works towards decreasing the objective function with better smoothness. Next, we present the assumptions used for the theoretical analysis. In the following section we define the problem and introduce the proposed algorithm and convergence proof.

2.2 Assumptions and notations

In general, we consider a finite sum problem of the following form:

$$ {\min}_{x \in \mathbb{R}^{d}} f(x)= \frac{1}{n}\sum\limits_{i=1}^{n} m_{i}(x), $$

(8)

where each $m_{i}: \mathbb {R}^{d} \rightarrow \mathbb {R}$ is a smooth and nonconvex function. Prior to the analysis, we state the mild assumptions and notations required for the proof. We work with the Euclidean spaces, $\mathbb {R}^{n}$ and $\mathbb {R}^{d}$, which are equipped with the standard inner product 〈.〉 and Euclidean norm ∥ . ∥. For any function f, $dom(f):= \{ x \in \mathbb {R}^{d} | f(x) < +\infty \}$ denotes the effective domain of f. If f is continuously differentiable, then ∇f denotes its gradient. Furthermore, if f is twice continuously differentiable, then ∇²f denotes its Hessian. Bold upper-case letters represent matrices A,B, normal upper-case letters represent scalars X,Y, and lower-case letters denote vectors x,y. x_t represents the value of x at time step t.

Assumption 1

A function f(x) is at least once differentiable and is L-smooth, i.e., for any $ x, y \in \mathbb {R}^{d}$ we have the following:

$$ f(x) \leq f(y)+ \langle \nabla f(y), x-y \rangle+\frac{L}{2}\parallel x-y \parallel^{2}, \forall x,y \in \mathbb{R}^{d}. $$

(9)

Assumption 2

The level set S₀ is bounded as follows:

$$ S_{0}= \{ x \in \mathbb{R}^{d} \lvert f(x) \leq f(x_{0}) \}, $$

(10)

i.e., there exists a constant R > 0 such that, ∥ x ∥≤ R,∀x ∈ S₀, where R << d.

Assumption 3

An obvious conclusion of the above two assumptions is that the gradient of the function and the stochastic gradient are bounded, i.e., as follows:

$$ \parallel \nabla m_{i}(x) \parallel <=C, \parallel \nabla m_{i}(x: \xi) \parallel <=C, $$

(11)

where ξ is a random variable, which implies, the following:

$$ \parallel \nabla f(x) \parallel <= \frac{1}{n}\sum\limits^{n}_{i=1}\parallel \nabla m_{i}(x)\parallel <=C. $$

(12)

Definition 1

A point x is said to be 𝜖- First order stationary point (FSP) for a function $f: \mathbb {R}^{d} \rightarrow \mathbb {R}$, if

$$ \parallel \nabla f(x)\parallel \leq \epsilon. $$

(13)

3 Methodology

Let f(x) be a noisy objective function, i.e., a differentiable stochastic scalar function with respect to parameters x. We are interested in minimizing the expected value of this function, E[f(x)] w.r.t. its parameters x. Let, f₁(x),…,,f_T(x) denote the realizations of the stochastic function at subsequent time steps 1,...,T. The stochasticity may result from the assessment of the data points in random subsamples (minibatches) or from intrinsic function noise. Let, ∇_xf_t(x) denote the gradient, i.e., the vector of partial derivatives of f_t, w.r.t x evaluated at timestep t

3.1 Proposed method

Our proposed Algorithm 1 inherits the qualities of both the paradigms of the first order method. The equation

$$ x_{t+1} \leftarrow x_{t}- (\gamma_{1}*\sin(\frac{\pi * g_{t}}{2*{g_{t}}+1})+\gamma_{2}) \nabla f_{t}(x_{t}), $$

(14)

relies on the following four important parameters: x_t (weight parameter), g_t (exponential moving average over x_t), β, which lies in an interval (0,1) that decides the rate of the moving average, and γ₁, which represents the weights given to the calculated step size. γ₂ is an additive lower limit, which prevents zero in the step size. To understand its connection with the existing algorithm and the overall improvement in the proposed algorithm, we simplified (14) by assigning different values to the parameters. Let us consider that the value of γ₁ equals zero; thus, (14) can be written as follows:

$$ x_{t+1} \leftarrow x_{t}- \gamma_{2} \nabla f_{t}(x_{t}). $$

(15)

The above equation is equivalent to the equation of SGD, which signifies how the inherent quality from SGD affects the generalization capability of our algorithm. To check its connection with the adaptive method, we consider γ₂ equal to zero. Thus, (14) can be written as follows:

$$ x_{t+1} \leftarrow x_{t}- \left( \gamma_{1}*\sin\left( \frac{\pi * g_{t}}{2*{g_{t}}+1}\right)\right) \nabla f_{t}(x_{t}). $$

(16)

We can observe the value of the sine function, which depends on g_t, and as g_t is tuned with every iteration, the value of the sine function changes for each iteration. The adaptability of our algorithm helps to obtain quicker convergence to better minima. Using a weight vector in the calculation of the step size gives significant advantages over gradient information, as it includes more stable information about the geometry of data. Random Gaussian weights with a deep neural network preserve local structures in the manifold, as proven in [29]. Weight parameters learn decision boundaries for the given data through extensive training. The best decision boundaries have an optimal distance from the points that need to be classified or predicted. In addition, the angle between intraclass points and interclass points is preserved while training the network. Thus, it was observed in [29], that every layer keeps the important information about the data, which is a characteristic very desirable for the classification task. Hence, weigh parameters represent the geometric information of the data across the loss landscape. However, the gradient alone is insufficient to carry such information because it is dependent on the loss surface, which makes it fluctuate heavily. Hence, we consider a weighted average of the weight parameters’ magnitude to compute our step size. Through experiments, we find that it outperforms the baselines. Moreover, for the proposed algorithm, we observed that the values of γ₁,γ₂ depend greatly on the architecture we use for our tasks.

3.2 Step size analysis

The proposed step size consists of the following three important components: the values of γ₁,γ₂ and the sine function used over the first moment of the weight parameters. The values of γ₁ and γ₂ determine the range of fluctuation for the step size. These parameters depend on the model architecture. An architecture that smooths out the complex loss landscape requires a smaller value, while the value is high for the opposite case.

We know that − 1 ≤ sin(x) ≤ 1. For the proposed step size, the parameter satisfies $ 0 \leq \frac { g_{t} }{ 2 g_{t}+1} \leq 1$. Therefore, $ 0 \leq \frac { g_{t} \pi }{ 2 g_{t} +1} \leq \pi $. Thus,

$$ 0 \leq \sin\left( \frac{ g_{t} \pi }{ 2 g_{t} +1} \right) \leq 1, $$

(17)

$$ 0 \leq \sum\limits_{k=0}^{T} \sin\left( \frac{ g_{k} \pi }{ 2 g_{k} +1} \right) \leq T+1. $$

(18)

The maximum value for the above sine function after the T iteration will be T + 1 and the minimum will be zero. Moreover, the sine function is non-monotone. To understand the behavior of the proposed step size, we categorize the value of ∥ x_t ∥ into the following ranges: when the value of ∥ x_T ∥≤ 1 and when the value of ∥ x_T ∥≥ 1. Let us assume for consecutive T iterations, that the value of 0 ≤∥ x_T ∥≤ 1; then, $\frac {g_{T}}{2g_{T}+1}$ lies near to zero. This implies

$$ \sin(\parallel x_{T} \parallel)< \sin\left( \frac{g_{T} \pi}{2g_{T}+1}\right). $$

(19)

This means that if the minima lie very near zero for a nonconvex problem, and initialization of weights is done within a unit circle, then gradient information of the loss function with respect to weights will quickly vanish when the network size becomes large, as the initial layer of the network suffers from the vanishing gradient problem. The stated problem can be easily addressed by our proposed step size. The above equation always assures that the value of sine over the expected x_t is greater or equivalent to the sine over the original x_t. To control the stability of the sine function, we provide weight through γ₁. When the sine function becomes zero, then the variable γ₂ assures a decrease in the objective function. Now, when ∥ x_t ∥≥ 1, then $\frac {g_{T}}{2g_{T}+1}$ lies near 1, which implies

$$ \sin(\parallel x_{T} \parallel) \geq \sin\left( \frac{g_{T} \pi}{2g_{T}+1}\right). $$

(20)

We can reasonably assume from the existing works that ∇f(x_t) is large when x is far from the minima of the loss function. Thus, the magnitude of the resultant vector will also be large. Therefore, the value of the step size will be closer to 1, which implies, that it provides an opportunity to reduce the value of the objective function more significantly in the initial iterations. In the initial iterations, the step size increases when the value of g is not very large and significantly reduces the objective value. However, when x is close to the minima of the loss function, ∇f(x) becomes small. Consequently, the value of g is close to the moving average of all the resultant vectors. The parameters { β,(1 − β)} give weights to the present and past magnitudes of the resultant vector. Therefore, the step length will not be minimal but will decrease and can help the optimizer surpass the suboptimal points.

3.3 Convergence analysis in a deterministic setting

Lemma 4

(Necessary condition for convergence) For a positive series ${\sum }_{n=1}^{\infty }a_{n}$, with any natural number k , if the partial sum $S_{k} = {\sum }_{n=1}^{k}a_{n}$ has a constant upper bound C that is independent of k, then we have

$$ lim_{n \rightarrow \infty}a_{n}= 0. $$

(21)

Theorem 5

[50] Let $\{x_{n}\}\begin {array}{l}\infty \\[-1ex]n=1\end {array} and \{\ y_{n}\}\begin {array}{l}\infty \\[-1ex]n=1\end {array} $ be two convergent sequences of real numbers; then,

$$ if \quad lim_{n\rightarrow \infty} x_{n} = x, \quad and \quad lim_{n\rightarrow \infty}y_{n} = y, \qquad then \quad lim_{n\rightarrow \infty} x_{n} y_{n} = xy. $$

(22)

Lemma 6

For time sequence t = 0,…,T − 1, the recurrence relation is bounded as

$$ g_{t}= \beta * g_{t-1}+(1- \beta)*\parallel x_{t}\parallel \leq T. $$

(23)

Proof

By iterating and expanding the above recurrence and then simplifying altogether and using assumptions (2), we have,

$$ g_{T}= (1-\beta)\sum\limits^{T-1}_{i=0} \beta^{n-1-i}\parallel x_{i}\parallel \leq R(1-\beta^{T}). $$

(24)

As the value of β < 1, for a large T,

$$ R(1-\beta^{T}) \approx R \leq T. $$

(25)

□

Theorem 7

Suppose f(x) satisfy assumptions (1), (2), (3), and WSAGrad runs under batch setting with a positive step size sequence and, $f^{*} = inf_{x}f(x) \geq - \infty $, then

$$ lim_{T\rightarrow \infty} {\min}_{t=0:T}\parallel \nabla f(x_{t}) \parallel= 0. $$

(26)

Proof

From (9) , we have the following:

$$ f(x_{t+1}) \leq f(x_{t})+ \langle \nabla f(x_{t}), x_{t+1}-x_{t} \rangle +\frac{L}{2}\parallel x_{t+1}-x_{t}\parallel^{2}. $$

(27)

Using the update equation given in Algorithm 1, we have the following:

$$ f(x_{t+1}) \leq f(x_{t})+ \left( \frac{L}{2}\eta_{t}-1\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2}. $$

(28)

Rearranging the above equation, we obtain the following:

$$ \left( \frac{L}{2}\eta_{t}-1\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2} \leq f(x_{t+1})- f(x_{t}). $$

(29)

After summing the values from t = 0 to T − 1, using telescopic sum, we obtain

$$ \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2} \leq f(x_{0})- f(x^{*}). $$

(30)

Thus,

$$ \begin{array}{@{}rcl@{}} {\min}_{t=0:T} \parallel \nabla f(x_{t})\parallel^{2} \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t} &\leq& \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2}\\ &\leq& f(x_{0})- f(x^{*}), \end{array} $$

(31)

which is equivalent to the following:

$$ {\min}_{t=0:T} \parallel \nabla f(x_{t}) \parallel^{2} \leq \frac{f(x_{0})- f(x^{*})}{{\sum}^{T-1}_{t=0}(1-\frac{L}{2}\eta_{t})\eta_{t} }. $$

(32)

We can obtain an upper bound of the denominator of

$$ \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t} \leq T\left( 1-\frac{L}{2} \eta_{z}\right)\eta_{z}, $$

(33)

where $\eta _{z}= \max \limits _{t}(\eta _{t}),\max \limits _{t}(\eta _{t})=(\gamma _{1}+\gamma _{2}) $. Rewriting the above equation provides a bound for the minimum square norm of the gradient in T steps. Upon taking the limit on both sides of the above equation, we obtain

$$ {\min}_{t=0:T} \parallel \nabla f(x_{t}) \parallel^{2} \leq \frac{f(x_{0})- f(x^{*})}{T(1-\frac{L}{2} \eta_{z})\eta_{z} }. $$

(34)

□

4 Empirical study

We study the performance of our algorithm on the problems of multiclass classification. The proposed algorithm spans different architectures, and the complexity varies from basics to advance. We use SGD, SGD − Nesterov and ADAM as baselines on the FMNIST, CIFAR − 10 and CIFAR − 100 datasets. Each method’s performance is evaluated with the best setting of manually tuned hyperparameters, and the performance is stated under multiple restarts.

4.1 Synthetic Experiment

To compare the performance of WSAGrad with that of ADAM, we consider a simple convex function from [18].

$$ f_{t}(x) = \begin{cases} 1010x & \text{with probability 0.01} \\ -10x & otherwise, \end{cases} $$

(35)

with constraint set F = [− 1,1]. The optimal solution stated in [18] is, at x = − 1. Thus, for convergence, we expect the algorithms to converge at x = − 1. For this sequence of functions, we investigate the average regret calculated as in (36) and the value of iterate x_t for ADAM and WSAGrad.

$$ R(T)= \frac{1}{T} \sum\limits_{t=1}^{T}[f_{t}(x_{t})- f_{t}(x^{*})] $$

(36)

To enable a fair comparison, we set β₁ = 0.9, and β₂ = 0.99, which are typical settings in ADAM. For WSAGrad, we consider β = 0.99. Figure 1 shows the average regret and value of x_t at the t iteration. From the two plots in Fig. 1, we can conclude that WSAGrad converges at x_t = − 1, which is the optimal value for the parameter, and ADAM converges near zero. Therefore , the average regret for WSAGrad is lower than that for ADAM.

4.2 Neural network

With nonconvex objective functions, multilayer neural networks, CNN, VGG, and ResNet some powerful deep learning models. Although our convergence analysis may not be applicable up to a certain extent to nonconvex problems, we empirically observe that WSAGrad outperforms other methods for the task of classification. We use MLP, CNN, VGG, and ResNet for the classification task on three different datasets. An overview of our experiments is presented in Table 2.

Table 2 Overview of experiments: our experiments span across classic datasets along with traditional and modern architectures

Full size table

4.2.1 Multilayer perceptron

We use two configurations for our experiments: one with sigmoid in all the layers and another with SoftMax in the last layer and sigmoid in all the rest. To measure the performance of our algorithm, we prefer sigmoid over any other activation type. The reason behind using sigmoid in all layers is that it creates a vanishing and exploding gradient problem in the initial and final layers, respectively, which can stop the learning task for all the algorithms.

FMNIST

Fashion-MNIST(FMNIST) [51] is a dataset consisting of a training set of 60000 examples images and a test set of 10000 examples. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels. This pixel-value is an integer between 0 and 255. For our experiments, we normalize the values. Our neural network contains two hidden layers of sizes 256 and 128 and an output layer of size 10. We use the cross-entropy function as a loss function with l₂ regularization and an accuracy metric to measure the performances of different algorithms.

We repeat the experiment multiple times with a batch size of 256 for SoftMax and a size of 64 for all sigmoids with different seeds. For all the experiments, we decay the learning rate after every 1000^th iteration with a decay factor of 0.7. Then we plot the best results for all the algorithms under the consistent configuration. For SGD and its variant, we consider 0.1 as the suitable learning rate; for ADAM and its variant, the suitable learning rate is 0.001. β₁ = 0.9 and β₂ are chosen from {0.99,0.999}. We manually tune parameter β for our algorithm to obtain the best set of results for the defined models. WSAGrad provides the best result for the value of β = 0.5.

CIFAR-10

The CIFAR − 10 [52, 53] dataset consists of 60000 of 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Among them, the training batches contain exactly 5000 images from each class. The classes are completely mutually exclusive. For the experiments, normalization was performed. We use a neural network with hidden layers, of size 655, 256, and an output layer of size 10. We use the cross-entropy function as a loss function with l₂ regularization and accuracy metric to measure the performance of the different algorithms. We repeat the experiments multiple times with a batch size of 128 for both configurations with different seeds. The rest of the configurations are the same as those of the previous dataset FMNIST, and the best results are plotted. WSAGrad gives the best result for the value of β = 0.6.

CIFAR-100

This dataset [52] is similar to CIFAR − 10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The reason for using CIFAR − 100 is the increasing complexity since limited data are available per class to train. To test the authenticity of our algorithm, we use both the sigmoid and SoftMax configurations. The dataset is difficult, and using sigmoid makes learning difficult. For this specific dataset, we use a dropout of 0.2 just after the first layer and 0.5 after the second layer. The rest of the configuration and preprocessing are similar to those of CIFAR − 10.WSAGrad gives the best result for the value of β = 0.6.

The average accuracy of MLP for different optimizers on the above dataset is shown in Table 3.

Table 3 Mean accuracy with standard deviation of the different optimizer’s for MLP on different datasets

Full size table

4.2.2 Deep networks

CNN Our CNN architecture has two alternating stages of 5 × 5 convolutional filters and 3 × 3 max pooling with stride two followed by a fully connected layer of 128 sigmoid units. The channel sizes are 32, and 16 respectively. The batch size is 128. Cross entropy with l₂ regularization is used for loss. For CIFAR − 10, we use the same configuration but with a few changes. For a set of experiments, we replace the last layer sigmoid units with SoftMax, and in another set, we keep sigmoid units in the last layer. In the third set, we use ReLU for fully connected layers. For CIFAR − 100, we use the architecture with all sigmoids.

VGG on CIFAR-10

V GGNet [54] is another advanced architecture in the field of computer vision after CNN. The structure captures the depth of CNN s. We use the V GG − 11 architecture for our experiment, as shown in Fig. 6. It consists of 11 weighted layers. The weights represent the strength of connections between units in adjacent network layers, and weights close to zero mean that changing the input will not change the output, or the change will be negligible. The primary reason for using V GG − 11 in our experiments is the vanishing gradient problem that occurs when the weights substantially reduce to zero, significantly impacting the learning task. The normal distribution is used to initialize the weights. An overview of the architecture is given below.

We conduct our experiments for the above architecture set with different variations tested on different batch sizes with fine-tuned learning rates. The variations in the architecture were created by replacing the activation function of the fully connected last layers. The reason for testing on different batch sizes is that a large batch induces smoothness, while a smaller batch does not. Therefore, it is more difficult to train on small batches than on larger batches.

ResNet18 on CIFAR-10

ResNet [55] is an advanced architecture compared to VGG. It allows residual connection between layers, which helps it overcome the vanishing gradient problem. We modify the architecture by replacing the activation type of the last layer. The network in Table 8 is tested on CIFAR − 10 with different batch sizes and finely tuned hyperparameters. Cross entropy with l₂ regularization is used for loss. The hyperparameters for the baselines and proposed method are shown in Table 2. The value of γ₁ is selected based on the stability and low error rate. It is determined through multiple sets of experiments on different values of γ₁, as shown in Fig. 8. The most suitable choice for γ₁ for the mentioned architecture was {0.3,0.5}, which can be observed from Fig. 8. We select 0.5 over 0.3 due to its better stability and accuracy.

4.3 Results and discussion

For each set of experiments, we plot the best results and report the mean and the standard deviation of the algorithms’ performances in Tables 3, 4, 5, 6, 7, 8 and 9. For MLP, the performance of WSAGrad is equivalent to greater than existing state-of-the-art techniques.

Table 4 Mean accuracy with standard deviation of the different optimizer’s on Convolution net with sigmoid and SoftMax functions as the activation on different datasets

Full size table

Table 5 Mean accuracy with standard deviation of different optimizer’s on Convolution net with ReLU and SoftMax functions as the activation function, on different dataset

Full size table

Table 6 Mean accuracy with standard deviation and execution time on VGG-11 for CIFAR-10: the accuracy is averaged over multiple restarts with linear activation in the last layer and a batch size of 128

Full size table

Table 7 Mean accuracy with standard deviation on VGG-11 for CIFAR-10: The accuracy is averaged over multiple restarts with different batch sizes and activation types; for SoftMax with batch size 32, we consider the value of {γ₁,γ₂}= {0.05,0.01}; for batch size 128, {γ₁,γ₂}= {0.1,0.01} and β = 0.1

Full size table

Table 8 Architectures of ResNet-18 on CIFAR-10

Full size table

Table 9 Mean accuracy with standard deviation of ResNet-18 on CIFAR-10: the accuracy is averaged over multiple restarts with different batch sizes and activation types

Full size table

From Figs. 2, and 3, we find that the performance of WSAGrad is better than that of ADAM and AdamW. Additionally, from Fig. 2 we observe that ADAM, AdamW, and SGD − Nesterov converge in a nearby neighbourhood; nevertheless, the generalization capability of SGD − Nesterov suffers. This is due to the continuous addition of bias in the update rule when the momentum parameter is large. A large momentum parameter prefers past updates instead of the current one. In theory, the performance of SGD − Nesterov is independent of initialization, but in practice, we observe that the performance of SGD and SGD − Nesterov heavily depend on the initialization and complexity of the model. Furthermore, we observe that the performance of WSAGrad increases smoothly as reflected in Figs. 2 and 3, with successive iterations without affecting the model’s generalization capability. We observe a similar performance of our algorithm for all the models and datasets used for empirical validation.

It is worth noting from Table 3, that the mean accuracy of our algorithm is higher than that of the top baselines, except for MLP with SoftMax, where ADAM is equivalent to or greater than WSAGrad. A possible reason is that MLP with SoftMax forces the structure of one winner, even when it has access to a limited set of features. For example, training a neural network creates a set of groups across the different layers and within the layers, as proved in [29]. Each set of groups carries information about a set of features; using SoftMax in the last layer induces a one-winner relationship between groups. This relationship improves the generalization capability when the dataset either has a higher number of variables or a lower number of instances. However, when we run WSAGrad on MLP with SoftMax, there is a lack of a deep relation between the groups and the geometric information of the data carried by the neural network. This is not true when MLP is called with sigmoid, as it imbibes multiple winner structures over the neural network that carries enough geometric information for WSAGrad. For deep networks, we perform multiple sets of experiments with different architectures. The baselines and proposed method are tested with these architectures, and the results are reported here. For CNN, the results are stated in Tables 4 and 5. From the results, we conclude that WSAGrad performs better for all activations types. The best performance of our proposed method is given under the variation of ReLU. Under this variation, for CIFAR − 10 and CIFAR − 100, the performance of SGD − Nesterov is equivalent to ours, followed by SGD and ADAM. Apart from the best performer, we can observe that WSAGrad is more robust to a model change compared to the baselines. This property makes it more reliable, as we can achieve significantly better accuracy under a basic set of models and activations. In other words, we can always trust WSAGrad in regard to the challenging aspect of designing a better model, a loss function or a better optimizer. Figures 4 and 5 validate its ability to outperform other state-of-the-art methods. For advanced architectures such as V GG − 11 (Fig. 6) and ResNet − 18, we create different variations of architectures with different activations and test them on different batch sizes. We use the V GG − 11 architecture with SoftMax or a linear function as an activation function in the last layer. We can observe from Table 7 that WSAGrad performs best under different variations and batch sizes. Moreover, we observe that V GG − 11 with a small batch size is difficult to train, as it quickly leads to overfitting. For larger batch sizes, the performance depends on the activation type and hyperparameters. WSAGrad is worth using for training, as the performance does not fluctuate much under different settings. The results and execution times are reported in Table 6. Figure 7 depicts our findings. It shows that the behavior of WSAGrad is nonmonotonic as it oscillates within a certain range. This type of oscillation is required to find a better minimum. WSAGrad forces the value of the objective function to increase so that it can reach better minima in near iterations. The minima of a loss landscape lie just below suboptimal points [56], and most of the minima are on the same level. The absence of a gradient in such a region stagnates the performance of the optimizers and leads it to a minimum that is near initialization point [57]. With WSAGrad, the exponential moving average of the weight parameters will not decrease quickly, even under small updates. Therefore, the value of the sine function will not be small, and it will maintain a moderate step size. A moderate step size increases the objective value to a certain extent such that for coming iterations, it can be decreased as needed. In addition, our step size is dependent on weight parameters, which are tuned at every iteration. Thus, with each iteration, the chance of reaching a better minimum increases. ResNet is designed to handle the problems of V GG − 11. The sensitivity of hyperparameters for ResNet-18 are tested and shown in Fig. 8, and results for different variations and batch sizes are reported in Table 9. Best results are shown in Fig. 9. We observe that WSAGrad performs well when SoftMax used in the final layer. When we switch to ReLU, ADAM dominates over WSAGrad. From Fig. 9 we noted its nonmonotone behavior.

Overall, we observe that WSAGrad outperforms the other algorithms in the classification task on different sets of architectures with varying batch sizes and activation types. SGD − Nesterov performs better or equivalent to ADAM for FMNIST and CIFAR − 10. Additionally, we emphasize that the test accuracy across the models and dataset for all the algorithms is relatively low compared to the accuracy achieved through different architectures. We restrict ourselves from using any other methods that can smooth out the loss landscape and undermine the algorithms’ actual performance. However, we still assert that WSAGrad outperforms under proper initialization of the parameters. On CIFAR − 10, the accuracy achieved by WSAGrad is better than all the baselines under the settings of different activation types.

5 Conclusion and future work

The present work introduces a novel step size to operate on the vanishing gradient problem under extreme nonconvexity, to train deep neural networks. The experimental findings under various conditions show that the proposed model is considerably better than the existing baselines for the classification task. Even for a simple convex problem, we have shown its preeminence. The beauty of our algorithm lies in its relationship with the geometry of the data and loss function. The proposed step size carries the geometric information and maps it as required, nullifying the divergence scope. The effectiveness of the proposed model can be observed, with an overall average performance gain of 3 − 4% achieved on the standard metric (Accuracy) for the classification task. With respect to sigmoid activation the average performance gain of WSAGrad for the basic architecture i.e., MLP and CNN is nearly 5 − 6%. For advanced architectures such as VGG and ResNet, we obtained an average performance gain of 1 − 2% for SoftMax activation. Nevertheless, the proposed model is expected to perform well on similar objective tasks such as language modeling, neural translation, image reconstruction and many other applications in various domains. We plan to conduct the experiments on many diversified datasets in those domains as our future work. Additionally, it is essential to rigorously understand the behavior and be aware of potential pitfalls while using these methods in practice. We believe this paper is the first step in this direction and suggests a good design for faster and better optimization.

References

Punn NS, Agarwal S (2021) Automated diagnosis of covid-19 with limited posteroanterior chest x-ray images using fine-tuned deep neural networks. Appl Intell 51(5):2689–2702
Article Google Scholar
Gao J, Murphey YL, Zhu H (2018) Multivariate time series prediction of lane changing behavior using deep neural network. Appl Intell 48(10):3523–3537
Article Google Scholar
Mukherjee H, Ghosh S, Dhar A, Obaidullah SM, Santosh K, Roy K (2021) Deep neural network to detect covid-19: one architecture for both ct scans and chest x-rays. Appl Intell 51(5):2777–2789
Article Google Scholar
Smith ML, Smith LN, Hansen MF (2021) The quiet revolution in machine vision-a state-of-the-art survey paper, including historical review, perspectives, and future directions. Comput Ind 103472:130
Google Scholar
Lopez A (2008) Statistical machine translation. ACM Comput Surveys (CSUR) 40(3):1–49
Article Google Scholar
Brusa E, Delprete C, Di Maggio LG (2021) Deep transfer learning for machine diagnosis: from sound and music recognition to bearing fault detection. Appl Sci 11(24):11663
Article Google Scholar
Wang S, Jin S, Bai D, Fan Y, Shi H, Fernandez C (2021) A critical review of improved deep learning methods for the remaining useful life prediction of lithium-ion batteries. Energy Rep 7:5562–5574
Article Google Scholar
Wang S, Takyi-Aninakwa P, Jin S, Yu C, Fernandez C, Stroe D-I (2022) An improved feedforward-long short-term memory modeling method for the whole-life-cycle state of charge prediction of lithium-ion batteries considering current-voltage-temperature variation. Energy:124224
Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Advances Neural Inf Process Syst, vol 27
Mhaskar H, Liao Q, Poggio T (2017) When and why are deep networks better than shallow ones?. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 31
Blum AL, Rivest RL (1992) Training a 3-node neural network is np-complete. Neural Netw 5(1):117–127
Article Google Scholar
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17
Article Google Scholar
Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence o (1/k²). In: Doklady an Ussr, vol 269, pp 543–547
Yuan W, Hu F, Lu L (2021) A new non-adaptive optimization method: stochastic gradient descent with momentum and difference. Appl Intell:1–15
Ma J, Yarats D (2018) Quasi-hyperbolic momentum and adam for deep learning. In: International conference on learning representations
Attia A, Koren T (2021) Algorithmic instabilities of accelerated gradient descent. Adv Neural Inf Process Syst:34
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learning Res, vol 12(7)
Reddi SJ, Kale S, Kumar S (2019) On the convergence of adam and beyond preprint at arXive:1904.09237
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learning 4(2):26–31
Google Scholar
Zeiler MD (2012) Adadelta: an adaptive learning rate method preprint at arXive:1212.5701
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization, arXive:1412.6980
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. Adv Neural Inf Process Syst, vol 30
Shazeer N, Stern M (2018) Adafactor: adaptive learning rates with sublinear memory cost. In: International conference on machine learning. PMLR, pp 4596–4604
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate preprint at, 1902.09843
Reddi S, Zaheer M, Sachan D, Kale S, Kumar S (2018) Adaptive methods for nonconvex optimization. In: Proceeding of 32nd conference on neural information processing systems (NIPS 2018)
Nenavath H, Jatoth RK, Das S (2018) A synergy of the sine-cosine algorithm and particle swarm optimizer for improved global optimization and object tracking. Swarm Evol Comput 43:1–30
Article Google Scholar
Chen K, Zhou F, Yin L, Wang S, Wang Y, Wan F (2018) A hybrid particle swarm optimizer with sine cosine acceleration coefficients. Inform Sci 422:218–241
Article MathSciNet Google Scholar
Gupta S, Deep K, Moayedi H, Foong LK, Assad A (2021) Sine cosine grey wolf optimizer to solve engineering design problems. Eng Comput 37(4):3123–3149
Article Google Scholar
Giryes R, Sapiro G, Bronstein AM (2016) Deep neural networks with random gaussian weights: a universal classification strategy? IEEE Trans Signal Process 64(13):3444–3457
Article MathSciNet MATH Google Scholar
McMahan HB, Streeter M (2010) Adaptive bound optimization for online convex optimization preprint at, arXive:1002.4908
Dozat T (2016) Incorporating Nesterov momentum into Adam. In: Proc. Workshop Track (ICLR), pp 1–4
Rubio DM (2017) Convergence analysis of an adaptive method of gradient descent. University of Oxford, Oxford. M. Sc. thesis
Google Scholar
Bock S, Goppold J, Weiß M (2018) An improvement of the convergence proof of the adam-optimizer. TAGUNGSBAND, pp 80
Keskar NS, Socher R (2017) Improving generalization performance by switching from adam to sgd preprint at arXive:1712.07628
Chen X, Liu S, Sun R, Hong M (2019) On the convergence of a class of adam-type algorithms for non-convex optimization. In: 7th international conference on learning representations, ICLR 2019
Barakat A, Bianchi P (2021) Convergence and dynamical behavior of the adam algorithm for nonconvex stochastic optimization. SIAM J Optim 31(1):244–274
Article MathSciNet MATH Google Scholar
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization, arXive:1711.05101
Huang H, Wang C, Dong B (2019) Nostalgic adam: weighting more of the past gradients when designing the adaptive learning rate. In: Proceedings of the 28th international joint conference on artificial intelligence, pp 2556–2562
Chen J, Zhou D, Tang Y, Yang Z, Cao Y, Gu Q (2020) Closing the generalization gap of adaptive gradient methods in training deep neural networks. In: IJCAI
De S, Mukherjee A, Ullah E (2018) Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration, arXiv:1807.06766
Shi N, Li D (2021) Rmsprop converges with proper hyperparameter. In: International conference on learning representation
Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Adv Neural Inf Process Syst 33:18795–18806
Google Scholar
Cutkosky A, Orabona F (2019) Momentum-based variance reduction in non-convex sgd. Adv Neural Inf Process Syst, vol 32
Huang F, Li J, Huang H (2021) Super-adam: faster and universal framework of adaptive gradients. Adv Neural Inf Process Syst, vol 34
Wang B, Nguyen T, Sun T, Bertozzi AL, Baraniuk RG, Osher SJ (2022) Scheduled restart momentum for accelerated stochastic gradient descent. SIAM J Imaging Sci 15(2):738–761
Article MathSciNet MATH Google Scholar
Hafshejani SF, Gaur D, Hossain S, Benkoczi R (2021) Barzilai and borwein conjugate gradient method equipped with a non-monotone line search technique and its application on non-negative matrix factorization
Cutkosky A, Mehta H (2020) Momentum improves normalized sgd. In: International Conference on Machine Learning. PMLR, pp 2260–2268
Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: The 22nd international conference on artificial intelligence and statistics. PMLR, pp 983–992
Ward R, Wu X, Bottou L (2019) Adagrad stepsizes: sharp convergence over nonconvex landscapes. In: International conference on machine learning. PMLR, pp 6677–6686
Polansky AM (2011) Introduction to Statistical Limit Theory. Chapman and Hall/CRC, New York
Book MATH Google Scholar
Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms
Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970
Article Google Scholar
Krizhevsky A, Hinton G. et al (2009) Learning multiple layers of features from tiny images
Simonyan K, Zisserman A (2018) Very deep convolutional networks for large-scale image recognition karen. Am J Health Pharm 75:398–406
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Adv Neural Inf Process Syst, vol 27
Gupta C, Balakrishnan S, Ramdas A (2021) Path length bounds for gradient descent and flow. J Mach Learn Res 22(68):1–63
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, India
Krutika Verma & Abyayananda Maiti

Authors

Krutika Verma
View author publications
You can also search for this author in PubMed Google Scholar
Abyayananda Maiti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krutika Verma.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Verma, K., Maiti, A. WSAGrad: a novel adaptive gradient based method. Appl Intell 53, 14383–14399 (2023). https://doi.org/10.1007/s10489-022-04205-9

Download citation

Accepted: 25 September 2022
Published: 26 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04205-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

WSAGrad: a novel adaptive gradient based method

Abstract

Similar content being viewed by others

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Toward Novel Optimizers: A Moreau-Yosida View of Gradient-Based Learning

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Explore related subjects

1 Introduction

2 Background and notation

2.1 Background

2.2 Assumptions and notations

Assumption 1

Assumption 2

Assumption 3

Definition 1

3 Methodology

3.1 Proposed method

3.2 Step size analysis

3.3 Convergence analysis in a deterministic setting

Lemma 4

Theorem 5

Lemma 6

Proof

Theorem 7

Proof

4 Empirical study

4.1 Synthetic Experiment

4.2 Neural network

4.2.1 Multilayer perceptron

FMNIST

CIFAR-10

CIFAR-100

4.2.2 Deep networks

VGG on CIFAR-10

ResNet18 on CIFAR-10

4.3 Results and discussion

5 Conclusion and future work

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation