1 Introduction

The learning efficiency of deep networks [1,2,3] allowed them to be developed rapidly, and they have been successfully used in various applications, such as machine vision [4], machine translation [5], sound recognition [6], charge prediction in lithium-ion batteries [7, 8] and many more. The deep network-based approach (FFLSTM) in [8] establishes the groundwork for long-term state prediction of lithium-ion batteries with increasing energy management and safety. The extension of deep networks to such vivid applications shows their robustness. In addition to its practical success theoretical findings show its general competence [9, 10]. However, training deep learning models for nonconvex problems is strenuous because finding a global minimum is intractable. In the early 1990s, the authors of [11], showed that training a multilayer perceptron(MLP) is indeed NP-hard, which significantly contributed to the shrinkage of this field. While recent practical successes have revived the area, we still need to know the best minima that can be reached for deep models when trained via different optimizers.

Despite being proposed in the last century, the stochastic gradient descent (SGD) algorithm remains one of the most effective algorithms for training deep neural networks Its simplicity and efficiency make it suitable for almost all applications. However, SGD has the disadvantage of scaling the gradient uniformly in all directions. This might result in poor performance and limited training speed when data are sparse. Several variants of SGD have been proposed [12,13,14] to speed up the training process and improve performance. These variants perform well when the momentum parameter is significant. However, a large momentum induces staleness [15], implying a preference for past gradients over the current gradient. Consequently, it comes with a cost of reduced generalization capacity. Authors of [16] showed the instability of Nesterov’s accelerated method with respect to initialization and exponentially fast deterioration in the performance with the number of gradient steps. SGD’s dependency on the learning rate and the need for proper initialization for ill-conditioned problems necessitate a paradigm shift towards adaptive algorithms with provable guarantees.

A number of adaptive gradient algorithms have been proposed to address these problems. Usually, these algorithms alter the learning rate for each gradient coordinate based on the objective function’s current geometry curvature. For example, Adagrad [17] computes the adaptive learning rate by dividing gradients by a denominator equal to the root mean square of the previous gradients. A sparse gradient was reported to lead to a quicker convergence of Adagrad compared to the vanilla SGD [17]. However, Adagrad has a limited ability to generalize on unseen data [18]. In the presence of a sizeable dense gradient, the mean in the denominator makes the update small, leading to the failure of Adagrad.

Some adaptive algorithms have been proposed to address this issue by replacing the denominator with the square root of the exponential moving average of the past gradient squares, i.e., RMSProp [19], AdaDelta [20], and ADAM [21]. ADAM has become the most popular among all the variants due to its faster convergence and computational efficiency. Regardless of its faster convergence, recent theoretical studies show its instability and weaker convergence guarantee [18, 22]. Experiments conducted in [23] show that when the decay rate of the second moment accumulator is too slow, larger-than-desired updates might be provided. The generation of extreme learning rates during training is reported in [24]. The parameter 𝜖 is introduced in the denominator of the update rule of ADAM [21] to prevent the denominator from becoming zero. Its influence on the performance of ADAM was later discovered in [25].

To address the challenges of ADAM and SGD, we introduce An A daptive Grad ient method with W eighted S ine based step size(WSAGrad). Sine-and cosine-based algorithms are generally used with derivative–free algorithms [26,27,28]. This is the first work to use a sine function to compute the step size for a gradient–based method. This method captures the geometric properties of the data through weight parameters. The intuition of using a weight vector is linked with the primary idea of preserving the metric structure of the data via a deep network initialized with random Gaussian weights, as mentioned in [29]. These structures propagate along the layer, enabling robust recovery of the original data from the features computed by the network. In other words, training a deep network preserves the angle between intraclass and interclass points, which implies that every layer preserves the important information about the data [29]. Hence, using the weight vector to compute the step size of WSAGrad will incorporate this information, which is further tuned with each set of iterations. WSAGrad shows a significant improvement in performance for different deep learning tasks over other state-of-the-art techniques. A theoretical guarantee is proposed with some mild assumptions. The main highlights of this paper are as follows:

  • The paper contributes by proposing a novel adaptive step size based gradient descent algorithm that inherits important capabilities of both approaches such as the adaptability and generalization ability of ADAM and SGD.

  • The step size is constructed using a trigonometric function over the exponential moving average of the norm of the weight parameters. It is designed to operate on the vanishing gradient problem. Instead of stopping the learning task, it minutely increases the objective value so that the coming iteration can reach a better minimum. Additionally, two controlling parameters are used to nullify the scope of divergence and assure stability inside a certain region. To the best of our knowledge, this is the first work to use a network’s weight parameters in the step size.

  • The step size is computationally efficient due to the small memory requirement per iteration with only one extra set of auxiliary variables. Furthermore, it reduces human intervention and hyperparameter tuning, making it more robust to different optimization problems.

  • Under a minimum assumption of smoothness and level bound, we prove the convergence of our algorithm.

  • We validate the performance of our algorithm through extensive experiments. We tested the performance of our algorithm compared to that of the baseline on the task of image classification on real-world data FMNIST, CIFAR − 10 and CIFAR − 100. The experiments span basic to advanced architectures of different batch sizes with various activations types.

It is worth noting that our algorithm’s performance significantly improves with the number of iterations when most of the existing methods saturate. We also show the stability of our algorithm in all sets of experiments. For each set of experiments, we have reported the best results, the mean and the standard deviation of our algorithm’s performance. The remained of this paper is organized as follows: we discuss the literature in Section 2. Section 3 outlines the problem definition in detail, followed by the proposed methodology and a theoretical guarantee under mild assumptions. Next, we describe the dataset, experimental setup, and baselines in Section 4, followed by results and analysis under the same heading. Lastly, Section 5 concludes the paper with some directions for related future works.

2 Background and notation

2.1 Background

Training a deep neural network (DNN) for a large-scale machine learning problem is a complex task. Recently, adaptive gradient methods, a class of optimization tools, have drawn the community’s attention due to their faster convergence on such tasks. The first popular algorithm in this line of research is Adagrad [17, 30], which achieves significantly better performance than the vanilla SGD when gradients are sparse or generally small. Update equation of Adagrad is as follows:

$$ x_{t+1}=x_{t}-\eta_{t}\frac{g_{t}}{\sqrt(v_{t})}, $$
(1)

where gt = ∇f(xt;ξt), and \(v_{t}= \frac {1}{t}{\sum }_{j=1}^{t} {g_{j}^{2}}\), and \(\eta _{t} = \frac {\eta }{\sqrt (t)}\) where η > 0 is the step size. The above equation shows the learning rate for each dimension. Although Adagrad works well for sparse settings, its performance has been observed to degrade in conditions where loss functions are nonconvex and gradients are dense. The quick decline of the learning rate in these settings makes it unattractive because it uses all the previous gradients in the update. This issue is compounded even more in high-dimensional deep learning problems.

To mitigate this problem, several variants of Adagrad, such as RMSProp [19], ADAM [21], AdaDelta [20], and NADAM [31] have been proposed to handle the rapid decay of the learning rate by using the exponential moving averages of past squared gradients. Essentially, these strategies limit the update to only the past few gradients. Of these variants, ADAM is the default method of choice for training DNN s. The updated equation adopts the following form:

$$ \begin{array}{@{}rcl@{}} m_{t}&=& \alpha_{1}m_{t-1}+(1-\alpha_{1})\nabla f(x_{t};\xi{t}), \hat{m_{t}}= \frac{m_{t}}{1-{\alpha_{1}^{t}}}\\ v_{t}&=& \alpha_{2}v_{t-1}+(1- \alpha_{2})(\nabla f(x_{t};\xi{t}))^{2}, \hat{v_{t}}= \frac{v_{t}}{1-{\alpha_{2}^{t}}}, \end{array} $$
(2)
$$ x_{t+1}= x_{t}-\frac{\eta_{t}\hat{m_{t}}}{\sqrt(\hat{v_{t}})+\epsilon}, \forall t \geq 1, $$
(3)

where α1,α2 ∈ (0,1), and 𝜖 > 0 and \(\eta _{t}= \frac {\eta }{\sqrt (t)}\), η > 0. However, to our knowledge, ADAM still has no convergence proof. The proof in the original paper [17] was shown wrong in [32, 33]. A counterexample was introduced in [18], and an improved form of (3) was introduced and named AMSGrad. It follows:

$$ \hat{v_{t}}= \max(\hat{v_{t-1}},\hat{v_{t}}); x_{t+1}= x_{t}-\frac{\eta_{t}\hat{m_{t}}}{\sqrt(\hat{v_{t}})}. $$
(4)

However, in the experiments, AMSGrad does not show improvements [34, 35]. In contrast, it sometimes results in a worse accuracy than ADAM. For ADAM-type optimizers, convergence is studied in [35] for nonconvex problems, and the authors reported that a minimum could be achieved with no guarantee of staying there. Another work [36] used a dynamical system viewpoint to interpret the ADAM optimizer. This approach uses a non-autonomous ordinary differential equation without the Lipschitz condition. Subsequently, to improve the generalization performance of ADAM, another variant with a convergence guarantee, namely, AdamW, was introduced [37]. It is defined as follows:

$$ \begin{array}{@{}rcl@{}} g_{t}&=& \nabla f(x_{t};\xi{t})+\lambda x_{t}, m_{t}= \alpha_{1}m_{t-1}+(1-\alpha_{1})g_{t},\\ \hat{m_{t}}&=& \frac{m_{t}}{1-{\alpha_{1}^{t}}}, v_{t}= \alpha_{2}v_{t-1}+(1- \alpha_{2})(g_{t})^{2}, \hat{v_{t}}= \frac{v_{t}}{1-{\alpha_{2}^{t}}}, \end{array} $$
(5)
$$ x_{t+1}=x_{t}-\eta_{t}(\alpha\hat{m_{t}}/(\sqrt(\hat{v_{t}})+\epsilon)+\lambda x_{t}), $$
(6)

where, α1,α2 ∈ (0,1), α > 0,λ > 0 and 𝜖 > 0. In [38], under a convex setting, the authors proposed stabilizing the coordinatewise weighting factors to ensure convergence. PADAM [39] has been introduced with a partial adaptive parameter for a better generalization, which changes the coordinatewise weighting factor. The convergence rates of the original ADAM and RMSprop under the full-batch (deterministic) setting are provided in [40, 41]. AdaBelief [42] has been proposed to obtain a good generalization by adopting the step size according to the ‘-belief-’ in the current gradient direction. All the equations are similar to ADAM except for vt, which is defined as follows:

$$ v_{t}=\alpha_{2}v_{t-1}+(1-\alpha_{2})(\nabla f(x_{t};\xi{t})-m_{t})^{2}+\epsilon. $$
(7)

More recently, some accelerated adaptive gradient methods [43, 44] have been proposed based on variance-reduced techniques. In particular, the authors of [44], proposed SUPERADAM, a faster and universal adaptive gradient framework based on a universal adaptive matrix. Recently, an adaptive variant of Nesterov accelerated gradient descent was introduced in [45], which replaces the constant momentum with an adaptive momentum. To assure stability, the momentum resets to zero according to a scheduler. In Table 1, we list some of the most recent algorithms and their convergence rates.

Table 1 Convergence rates of different algorithms: here, T denotes the number of iterations, and b denotes the mini-batch size

This paper attempts to design an algorithm that works well with a minimum set of resources and models. Rather than focusing on the moving average of the gradient as most of the previous algorithms have done, our work depends on the weight parameters. The idea is that with each iteration, we obtain weights that are more tuned than the previous weights. An adjusted weight with a step size as defined in [46] works towards decreasing the objective function with better smoothness. Next, we present the assumptions used for the theoretical analysis. In the following section we define the problem and introduce the proposed algorithm and convergence proof.

2.2 Assumptions and notations

In general, we consider a finite sum problem of the following form:

$$ {\min}_{x \in \mathbb{R}^{d}} f(x)= \frac{1}{n}\sum\limits_{i=1}^{n} m_{i}(x), $$
(8)

where each \(m_{i}: \mathbb {R}^{d} \rightarrow \mathbb {R}\) is a smooth and nonconvex function. Prior to the analysis, we state the mild assumptions and notations required for the proof. We work with the Euclidean spaces, \(\mathbb {R}^{n}\) and \(\mathbb {R}^{d}\), which are equipped with the standard inner product 〈.〉 and Euclidean norm ∥ . ∥. For any function f, \(dom(f):= \{ x \in \mathbb {R}^{d} | f(x) < +\infty \}\) denotes the effective domain of f. If f is continuously differentiable, then ∇f denotes its gradient. Furthermore, if f is twice continuously differentiable, then ∇2f denotes its Hessian. Bold upper-case letters represent matrices A,B, normal upper-case letters represent scalars X,Y, and lower-case letters denote vectors x,y. xt represents the value of x at time step t.

Assumption 1

A function f(x) is at least once differentiable and is L-smooth, i.e., for any \( x, y \in \mathbb {R}^{d}\) we have the following:

$$ f(x) \leq f(y)+ \langle \nabla f(y), x-y \rangle+\frac{L}{2}\parallel x-y \parallel^{2}, \forall x,y \in \mathbb{R}^{d}. $$
(9)

Assumption 2

The level set S0 is bounded as follows:

$$ S_{0}= \{ x \in \mathbb{R}^{d} \lvert f(x) \leq f(x_{0}) \}, $$
(10)

i.e., there exists a constant R > 0 such that, ∥ x ∥≤ R,∀xS0, where R << d.

Assumption 3

An obvious conclusion of the above two assumptions is that the gradient of the function and the stochastic gradient are bounded, i.e., as follows:

$$ \parallel \nabla m_{i}(x) \parallel <=C, \parallel \nabla m_{i}(x: \xi) \parallel <=C, $$
(11)

where ξ is a random variable, which implies, the following:

$$ \parallel \nabla f(x) \parallel <= \frac{1}{n}\sum\limits^{n}_{i=1}\parallel \nabla m_{i}(x)\parallel <=C. $$
(12)

Definition 1

A point x is said to be 𝜖- First order stationary point (FSP) for a function \(f: \mathbb {R}^{d} \rightarrow \mathbb {R}\), if

$$ \parallel \nabla f(x)\parallel \leq \epsilon. $$
(13)

3 Methodology

Let f(x) be a noisy objective function, i.e., a differentiable stochastic scalar function with respect to parameters x. We are interested in minimizing the expected value of this function, E[f(x)] w.r.t. its parameters x. Let, f1(x),…,,fT(x) denote the realizations of the stochastic function at subsequent time steps 1,...,T. The stochasticity may result from the assessment of the data points in random subsamples (minibatches) or from intrinsic function noise. Let, ∇xft(x) denote the gradient, i.e., the vector of partial derivatives of ft, w.r.t x evaluated at timestep t

3.1 Proposed method

Algorithm 1
figure c

WSAGrad(x0,β,γ1,γ2).

Our proposed Algorithm 1 inherits the qualities of both the paradigms of the first order method. The equation

$$ x_{t+1} \leftarrow x_{t}- (\gamma_{1}*\sin(\frac{\pi * g_{t}}{2*{g_{t}}+1})+\gamma_{2}) \nabla f_{t}(x_{t}), $$
(14)

relies on the following four important parameters: xt (weight parameter), gt (exponential moving average over xt), β, which lies in an interval (0,1) that decides the rate of the moving average, and γ1, which represents the weights given to the calculated step size. γ2 is an additive lower limit, which prevents zero in the step size. To understand its connection with the existing algorithm and the overall improvement in the proposed algorithm, we simplified (14) by assigning different values to the parameters. Let us consider that the value of γ1 equals zero; thus, (14) can be written as follows:

$$ x_{t+1} \leftarrow x_{t}- \gamma_{2} \nabla f_{t}(x_{t}). $$
(15)

The above equation is equivalent to the equation of SGD, which signifies how the inherent quality from SGD affects the generalization capability of our algorithm. To check its connection with the adaptive method, we consider γ2 equal to zero. Thus, (14) can be written as follows:

$$ x_{t+1} \leftarrow x_{t}- \left( \gamma_{1}*\sin\left( \frac{\pi * g_{t}}{2*{g_{t}}+1}\right)\right) \nabla f_{t}(x_{t}). $$
(16)

We can observe the value of the sine function, which depends on gt, and as gt is tuned with every iteration, the value of the sine function changes for each iteration. The adaptability of our algorithm helps to obtain quicker convergence to better minima. Using a weight vector in the calculation of the step size gives significant advantages over gradient information, as it includes more stable information about the geometry of data. Random Gaussian weights with a deep neural network preserve local structures in the manifold, as proven in [29]. Weight parameters learn decision boundaries for the given data through extensive training. The best decision boundaries have an optimal distance from the points that need to be classified or predicted. In addition, the angle between intraclass points and interclass points is preserved while training the network. Thus, it was observed in [29], that every layer keeps the important information about the data, which is a characteristic very desirable for the classification task. Hence, weigh parameters represent the geometric information of the data across the loss landscape. However, the gradient alone is insufficient to carry such information because it is dependent on the loss surface, which makes it fluctuate heavily. Hence, we consider a weighted average of the weight parameters’ magnitude to compute our step size. Through experiments, we find that it outperforms the baselines. Moreover, for the proposed algorithm, we observed that the values of γ1,γ2 depend greatly on the architecture we use for our tasks.

3.2 Step size analysis

The proposed step size consists of the following three important components: the values of γ1,γ2 and the sine function used over the first moment of the weight parameters. The values of γ1 and γ2 determine the range of fluctuation for the step size. These parameters depend on the model architecture. An architecture that smooths out the complex loss landscape requires a smaller value, while the value is high for the opposite case.

We know that − 1 ≤ sin(x) ≤ 1. For the proposed step size, the parameter satisfies \( 0 \leq \frac { g_{t} }{ 2 g_{t}+1} \leq 1\). Therefore, \( 0 \leq \frac { g_{t} \pi }{ 2 g_{t} +1} \leq \pi \). Thus,

$$ 0 \leq \sin\left( \frac{ g_{t} \pi }{ 2 g_{t} +1} \right) \leq 1, $$
(17)
$$ 0 \leq \sum\limits_{k=0}^{T} \sin\left( \frac{ g_{k} \pi }{ 2 g_{k} +1} \right) \leq T+1. $$
(18)

The maximum value for the above sine function after the T iteration will be T + 1 and the minimum will be zero. Moreover, the sine function is non-monotone. To understand the behavior of the proposed step size, we categorize the value of ∥ xt ∥ into the following ranges: when the value of ∥ xT ∥≤ 1 and when the value of ∥ xT ∥≥ 1. Let us assume for consecutive T iterations, that the value of 0 ≤∥ xT ∥≤ 1; then, \(\frac {g_{T}}{2g_{T}+1}\) lies near to zero. This implies

$$ \sin(\parallel x_{T} \parallel)< \sin\left( \frac{g_{T} \pi}{2g_{T}+1}\right). $$
(19)

This means that if the minima lie very near zero for a nonconvex problem, and initialization of weights is done within a unit circle, then gradient information of the loss function with respect to weights will quickly vanish when the network size becomes large, as the initial layer of the network suffers from the vanishing gradient problem. The stated problem can be easily addressed by our proposed step size. The above equation always assures that the value of sine over the expected xt is greater or equivalent to the sine over the original xt. To control the stability of the sine function, we provide weight through γ1. When the sine function becomes zero, then the variable γ2 assures a decrease in the objective function. Now, when ∥ xt ∥≥ 1, then \(\frac {g_{T}}{2g_{T}+1}\) lies near 1, which implies

$$ \sin(\parallel x_{T} \parallel) \geq \sin\left( \frac{g_{T} \pi}{2g_{T}+1}\right). $$
(20)

We can reasonably assume from the existing works that ∇f(xt) is large when x is far from the minima of the loss function. Thus, the magnitude of the resultant vector will also be large. Therefore, the value of the step size will be closer to 1, which implies, that it provides an opportunity to reduce the value of the objective function more significantly in the initial iterations. In the initial iterations, the step size increases when the value of g is not very large and significantly reduces the objective value. However, when x is close to the minima of the loss function, ∇f(x) becomes small. Consequently, the value of g is close to the moving average of all the resultant vectors. The parameters { β,(1 − β)} give weights to the present and past magnitudes of the resultant vector. Therefore, the step length will not be minimal but will decrease and can help the optimizer surpass the suboptimal points.

3.3 Convergence analysis in a deterministic setting

Lemma 4

(Necessary condition for convergence) For a positive series \({\sum }_{n=1}^{\infty }a_{n}\), with any natural number k , if the partial sum \(S_{k} = {\sum }_{n=1}^{k}a_{n}\) has a constant upper bound C that is independent of k, then we have

$$ lim_{n \rightarrow \infty}a_{n}= 0. $$
(21)

Theorem 5

[50] Let \(\{x_{n}\}\begin {array}{l}\infty \\[-1ex]n=1\end {array} and \{\ y_{n}\}\begin {array}{l}\infty \\[-1ex]n=1\end {array} \) be two convergent sequences of real numbers; then,

$$ if \quad lim_{n\rightarrow \infty} x_{n} = x, \quad and \quad lim_{n\rightarrow \infty}y_{n} = y, \qquad then \quad lim_{n\rightarrow \infty} x_{n} y_{n} = xy. $$
(22)

Lemma 6

For time sequence t = 0,…,T − 1, the recurrence relation is bounded as

$$ g_{t}= \beta * g_{t-1}+(1- \beta)*\parallel x_{t}\parallel \leq T. $$
(23)

Proof

By iterating and expanding the above recurrence and then simplifying altogether and using assumptions (2), we have,

$$ g_{T}= (1-\beta)\sum\limits^{T-1}_{i=0} \beta^{n-1-i}\parallel x_{i}\parallel \leq R(1-\beta^{T}). $$
(24)

As the value of β < 1, for a large T,

$$ R(1-\beta^{T}) \approx R \leq T. $$
(25)

Theorem 7

Suppose f(x) satisfy assumptions (1), (2), (3), and WSAGrad runs under batch setting with a positive step size sequence and, \(f^{*} = inf_{x}f(x) \geq - \infty \), then

$$ lim_{T\rightarrow \infty} {\min}_{t=0:T}\parallel \nabla f(x_{t}) \parallel= 0. $$
(26)

Proof

From (9) , we have the following:

$$ f(x_{t+1}) \leq f(x_{t})+ \langle \nabla f(x_{t}), x_{t+1}-x_{t} \rangle +\frac{L}{2}\parallel x_{t+1}-x_{t}\parallel^{2}. $$
(27)

Using the update equation given in Algorithm 1, we have the following:

$$ f(x_{t+1}) \leq f(x_{t})+ \left( \frac{L}{2}\eta_{t}-1\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2}. $$
(28)

Rearranging the above equation, we obtain the following:

$$ \left( \frac{L}{2}\eta_{t}-1\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2} \leq f(x_{t+1})- f(x_{t}). $$
(29)

After summing the values from t = 0 to T − 1, using telescopic sum, we obtain

$$ \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2} \leq f(x_{0})- f(x^{*}). $$
(30)

Thus,

$$ \begin{array}{@{}rcl@{}} {\min}_{t=0:T} \parallel \nabla f(x_{t})\parallel^{2} \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t} &\leq& \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t}\parallel \nabla f(x_{t})\parallel^{2}\\ &\leq& f(x_{0})- f(x^{*}), \end{array} $$
(31)

which is equivalent to the following:

$$ {\min}_{t=0:T} \parallel \nabla f(x_{t}) \parallel^{2} \leq \frac{f(x_{0})- f(x^{*})}{{\sum}^{T-1}_{t=0}(1-\frac{L}{2}\eta_{t})\eta_{t} }. $$
(32)

We can obtain an upper bound of the denominator of

$$ \sum\limits^{T-1}_{t=0}\left( 1-\frac{L}{2}\eta_{t}\right)\eta_{t} \leq T\left( 1-\frac{L}{2} \eta_{z}\right)\eta_{z}, $$
(33)

where \(\eta _{z}= \max \limits _{t}(\eta _{t}),\max \limits _{t}(\eta _{t})=(\gamma _{1}+\gamma _{2}) \). Rewriting the above equation provides a bound for the minimum square norm of the gradient in T steps. Upon taking the limit on both sides of the above equation, we obtain

$$ {\min}_{t=0:T} \parallel \nabla f(x_{t}) \parallel^{2} \leq \frac{f(x_{0})- f(x^{*})}{T(1-\frac{L}{2} \eta_{z})\eta_{z} }. $$
(34)

4 Empirical study

We study the performance of our algorithm on the problems of multiclass classification. The proposed algorithm spans different architectures, and the complexity varies from basics to advance. We use SGD, SGDNesterov and ADAM as baselines on the FMNIST, CIFAR − 10 and CIFAR − 100 datasets. Each method’s performance is evaluated with the best setting of manually tuned hyperparameters, and the performance is stated under multiple restarts.

4.1 Synthetic Experiment

To compare the performance of WSAGrad with that of ADAM, we consider a simple convex function from [18].

$$ f_{t}(x) = \begin{cases} 1010x & \text{with probability 0.01} \\ -10x & otherwise, \end{cases} $$
(35)

with constraint set F = [− 1,1]. The optimal solution stated in [18] is, at x = − 1. Thus, for convergence, we expect the algorithms to converge at x = − 1. For this sequence of functions, we investigate the average regret calculated as in (36) and the value of iterate xt for ADAM and WSAGrad.

$$ R(T)= \frac{1}{T} \sum\limits_{t=1}^{T}[f_{t}(x_{t})- f_{t}(x^{*})] $$
(36)

To enable a fair comparison, we set β1 = 0.9, and β2 = 0.99, which are typical settings in ADAM. For WSAGrad, we consider β = 0.99. Figure 1 shows the average regret and value of xt at the t iteration. From the two plots in Fig. 1, we can conclude that WSAGrad converges at xt = − 1, which is the optimal value for the parameter, and ADAM converges near zero. Therefore , the average regret for WSAGrad is lower than that for ADAM.

Fig. 1
figure 1

(a) Average regret w.r.t. iterations of ADAM and WSAGrad. (b) Value of xt w.r.t. iterations of ADAM and WSAGrad

4.2 Neural network

With nonconvex objective functions, multilayer neural networks, CNN, VGG, and ResNet some powerful deep learning models. Although our convergence analysis may not be applicable up to a certain extent to nonconvex problems, we empirically observe that WSAGrad outperforms other methods for the task of classification. We use MLP, CNN, VGG, and ResNet for the classification task on three different datasets. An overview of our experiments is presented in Table 2.

Table 2 Overview of experiments: our experiments span across classic datasets along with traditional and modern architectures

4.2.1 Multilayer perceptron

We use two configurations for our experiments: one with sigmoid in all the layers and another with SoftMax in the last layer and sigmoid in all the rest. To measure the performance of our algorithm, we prefer sigmoid over any other activation type. The reason behind using sigmoid in all layers is that it creates a vanishing and exploding gradient problem in the initial and final layers, respectively, which can stop the learning task for all the algorithms.

FMNIST

Fashion-MNIST(FMNIST) [51] is a dataset consisting of a training set of 60000 examples images and a test set of 10000 examples. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels. This pixel-value is an integer between 0 and 255. For our experiments, we normalize the values. Our neural network contains two hidden layers of sizes 256 and 128 and an output layer of size 10. We use the cross-entropy function as a loss function with l2 regularization and an accuracy metric to measure the performances of different algorithms.

We repeat the experiment multiple times with a batch size of 256 for SoftMax and a size of 64 for all sigmoids with different seeds. For all the experiments, we decay the learning rate after every 1000th iteration with a decay factor of 0.7. Then we plot the best results for all the algorithms under the consistent configuration. For SGD and its variant, we consider 0.1 as the suitable learning rate; for ADAM and its variant, the suitable learning rate is 0.001. β1 = 0.9 and β2 are chosen from {0.99,0.999}. We manually tune parameter β for our algorithm to obtain the best set of results for the defined models. WSAGrad provides the best result for the value of β = 0.5.

CIFAR-10

The CIFAR − 10 [52, 53] dataset consists of 60000 of 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Among them, the training batches contain exactly 5000 images from each class. The classes are completely mutually exclusive. For the experiments, normalization was performed. We use a neural network with hidden layers, of size 655, 256, and an output layer of size 10. We use the cross-entropy function as a loss function with l2 regularization and accuracy metric to measure the performance of the different algorithms. We repeat the experiments multiple times with a batch size of 128 for both configurations with different seeds. The rest of the configurations are the same as those of the previous dataset FMNIST, and the best results are plotted. WSAGrad gives the best result for the value of β = 0.6.

CIFAR-100

This dataset [52] is similar to CIFAR − 10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The reason for using CIFAR − 100 is the increasing complexity since limited data are available per class to train. To test the authenticity of our algorithm, we use both the sigmoid and SoftMax configurations. The dataset is difficult, and using sigmoid makes learning difficult. For this specific dataset, we use a dropout of 0.2 just after the first layer and 0.5 after the second layer. The rest of the configuration and preprocessing are similar to those of CIFAR − 10.WSAGrad gives the best result for the value of β = 0.6.

The average accuracy of MLP for different optimizers on the above dataset is shown in Table 3.

Table 3 Mean accuracy with standard deviation of the different optimizer’s for MLP on different datasets

4.2.2 Deep networks

CNN Our CNN architecture has two alternating stages of 5 × 5 convolutional filters and 3 × 3 max pooling with stride two followed by a fully connected layer of 128 sigmoid units. The channel sizes are 32, and 16 respectively. The batch size is 128. Cross entropy with l2 regularization is used for loss. For CIFAR − 10, we use the same configuration but with a few changes. For a set of experiments, we replace the last layer sigmoid units with SoftMax, and in another set, we keep sigmoid units in the last layer. In the third set, we use ReLU for fully connected layers. For CIFAR − 100, we use the architecture with all sigmoids.

VGG on CIFAR-10

V GGNet [54] is another advanced architecture in the field of computer vision after CNN. The structure captures the depth of CNN s. We use the V GG − 11 architecture for our experiment, as shown in Fig. 6. It consists of 11 weighted layers. The weights represent the strength of connections between units in adjacent network layers, and weights close to zero mean that changing the input will not change the output, or the change will be negligible. The primary reason for using V GG − 11 in our experiments is the vanishing gradient problem that occurs when the weights substantially reduce to zero, significantly impacting the learning task. The normal distribution is used to initialize the weights. An overview of the architecture is given below.

We conduct our experiments for the above architecture set with different variations tested on different batch sizes with fine-tuned learning rates. The variations in the architecture were created by replacing the activation function of the fully connected last layers. The reason for testing on different batch sizes is that a large batch induces smoothness, while a smaller batch does not. Therefore, it is more difficult to train on small batches than on larger batches.

ResNet18 on CIFAR-10

ResNet [55] is an advanced architecture compared to VGG. It allows residual connection between layers, which helps it overcome the vanishing gradient problem. We modify the architecture by replacing the activation type of the last layer. The network in Table 8 is tested on CIFAR − 10 with different batch sizes and finely tuned hyperparameters. Cross entropy with l2 regularization is used for loss. The hyperparameters for the baselines and proposed method are shown in Table 2. The value of γ1 is selected based on the stability and low error rate. It is determined through multiple sets of experiments on different values of γ1, as shown in Fig. 8. The most suitable choice for γ1 for the mentioned architecture was {0.3,0.5}, which can be observed from Fig. 8. We select 0.5 over 0.3 due to its better stability and accuracy.

4.3 Results and discussion

For each set of experiments, we plot the best results and report the mean and the standard deviation of the algorithms’ performances in Tables 345678 and 9. For MLP, the performance of WSAGrad is equivalent to greater than existing state-of-the-art techniques.

Table 4 Mean accuracy with standard deviation of the different optimizer’s on Convolution net with sigmoid and SoftMax functions as the activation on different datasets
Table 5 Mean accuracy with standard deviation of different optimizer’s on Convolution net with ReLU and SoftMax functions as the activation function, on different dataset
Table 6 Mean accuracy with standard deviation and execution time on VGG-11 for CIFAR-10: the accuracy is averaged over multiple restarts with linear activation in the last layer and a batch size of 128
Table 7 Mean accuracy with standard deviation on VGG-11 for CIFAR-10: The accuracy is averaged over multiple restarts with different batch sizes and activation types; for SoftMax with batch size 32, we consider the value of {γ1,γ2}= {0.05,0.01}; for batch size 128, {γ1,γ2}= {0.1,0.01} and β = 0.1
Table 8 Architectures of ResNet-18 on CIFAR-10
Table 9 Mean accuracy with standard deviation of ResNet-18 on CIFAR-10: the accuracy is averaged over multiple restarts with different batch sizes and activation types

From Figs. 2, and 3, we find that the performance of WSAGrad is better than that of ADAM and AdamW. Additionally, from Fig. 2 we observe that ADAM, AdamW, and SGDNesterov converge in a nearby neighbourhood; nevertheless, the generalization capability of SGDNesterov suffers. This is due to the continuous addition of bias in the update rule when the momentum parameter is large. A large momentum parameter prefers past updates instead of the current one. In theory, the performance of SGDNesterov is independent of initialization, but in practice, we observe that the performance of SGD and SGDNesterov heavily depend on the initialization and complexity of the model. Furthermore, we observe that the performance of WSAGrad increases smoothly as reflected in Figs. 2 and 3, with successive iterations without affecting the model’s generalization capability. We observe a similar performance of our algorithm for all the models and datasets used for empirical validation.

Fig. 2
figure 2

Results for the FMNIST MLP sigmoid configuration: (a) Training Loss w.r.t. epoch for baselines and WSAGrad. (b) Testing Loss w.r.t. epoch for baselines and WSAGrad. (c) Testing Accuracy w.r.t. epoch for baselines and WSAGrad

Fig. 3
figure 3

Results for CIFAR-100 with the MLP sigmoid configuration: (a) Training Loss w.r.t. epoch for baselines and WSAGrad. (b) Testing Loss w.r.t. epoch for baselines and WSAGrad. (c) Testing Accuracy w.r.t. epoch for baselines and WSAGrad

It is worth noting from Table 3, that the mean accuracy of our algorithm is higher than that of the top baselines, except for MLP with SoftMax, where ADAM is equivalent to or greater than WSAGrad. A possible reason is that MLP with SoftMax forces the structure of one winner, even when it has access to a limited set of features. For example, training a neural network creates a set of groups across the different layers and within the layers, as proved in [29]. Each set of groups carries information about a set of features; using SoftMax in the last layer induces a one-winner relationship between groups. This relationship improves the generalization capability when the dataset either has a higher number of variables or a lower number of instances. However, when we run WSAGrad on MLP with SoftMax, there is a lack of a deep relation between the groups and the geometric information of the data carried by the neural network. This is not true when MLP is called with sigmoid, as it imbibes multiple winner structures over the neural network that carries enough geometric information for WSAGrad. For deep networks, we perform multiple sets of experiments with different architectures. The baselines and proposed method are tested with these architectures, and the results are reported here. For CNN, the results are stated in Tables 4 and 5. From the results, we conclude that WSAGrad performs better for all activations types. The best performance of our proposed method is given under the variation of ReLU. Under this variation, for CIFAR − 10 and CIFAR − 100, the performance of SGDNesterov is equivalent to ours, followed by SGD and ADAM. Apart from the best performer, we can observe that WSAGrad is more robust to a model change compared to the baselines. This property makes it more reliable, as we can achieve significantly better accuracy under a basic set of models and activations. In other words, we can always trust WSAGrad in regard to the challenging aspect of designing a better model, a loss function or a better optimizer. Figures 4 and 5 validate its ability to outperform other state-of-the-art methods. For advanced architectures such as V GG − 11 (Fig. 6) and ResNet − 18, we create different variations of architectures with different activations and test them on different batch sizes. We use the V GG − 11 architecture with SoftMax or a linear function as an activation function in the last layer. We can observe from Table 7 that WSAGrad performs best under different variations and batch sizes. Moreover, we observe that V GG − 11 with a small batch size is difficult to train, as it quickly leads to overfitting. For larger batch sizes, the performance depends on the activation type and hyperparameters. WSAGrad is worth using for training, as the performance does not fluctuate much under different settings. The results and execution times are reported in Table 6. Figure 7 depicts our findings. It shows that the behavior of WSAGrad is nonmonotonic as it oscillates within a certain range. This type of oscillation is required to find a better minimum. WSAGrad forces the value of the objective function to increase so that it can reach better minima in near iterations. The minima of a loss landscape lie just below suboptimal points [56], and most of the minima are on the same level. The absence of a gradient in such a region stagnates the performance of the optimizers and leads it to a minimum that is near initialization point [57]. With WSAGrad, the exponential moving average of the weight parameters will not decrease quickly, even under small updates. Therefore, the value of the sine function will not be small, and it will maintain a moderate step size. A moderate step size increases the objective value to a certain extent such that for coming iterations, it can be decreased as needed. In addition, our step size is dependent on weight parameters, which are tuned at every iteration. Thus, with each iteration, the chance of reaching a better minimum increases. ResNet is designed to handle the problems of V GG − 11. The sensitivity of hyperparameters for ResNet-18 are tested and shown in Fig. 8, and results for different variations and batch sizes are reported in Table 9. Best results are shown in Fig. 9. We observe that WSAGrad performs well when SoftMax used in the final layer. When we switch to ReLU, ADAM dominates over WSAGrad. From Fig. 9 we noted its nonmonotone behavior.

Fig. 4
figure 4

Results for CIFAR-100 with the CNN sigmoid configuration: (a) Training Loss w.r.t. epoch for baselines and WSAGrad. (b) Testing Loss w.r.t. epoch for baselines and WSAGrad. (c) Testing Accuracy w.r.t. epoch for baselines and WSAGrad

Fig. 5
figure 5

Results for CIFAR-10 with CNN: the top figures show the evaluation of training loss, testing loss and testing accuracy with SoftMax in the last layer on WSAGrad with different baselines. The bottom figures show the evaluation of training loss, testing loss and testing accuracy with sigmoid configuration on WSAGrad with different baselines

Fig. 6
figure 6

Architectureof VGG-11 on CIFAR-10

Fig. 7
figure 7

Results for CIFAR-10 with VGG11: (a) Training Loss w.r.t. epoch for baselines and WSAGrad. (b) Testing Loss w.r.t. epoch for baselines and WSAGrad. (c) Testing Accuracy w.r.t. epoch for baselines and WSAGrad

Fig. 8
figure 8

Sensitivity of the hyperparameter for ResNet18: (a) Testing loss w.r.t. epoch and (b) testing accuracy w.r.t. epoch for different values of the hyperparameter

Fig. 9
figure 9

Results for CIFAR-10 with ResNet18: (a) Training Loss w.r.t. epoch for baselines and WSAGrad. (b) Testing Loss w.r.t. epoch for baselines and WSAGrad. (c) Testing Accuracy w.r.t. epoch for baselines and WSAGrad

Overall, we observe that WSAGrad outperforms the other algorithms in the classification task on different sets of architectures with varying batch sizes and activation types. SGDNesterov performs better or equivalent to ADAM for FMNIST and CIFAR − 10. Additionally, we emphasize that the test accuracy across the models and dataset for all the algorithms is relatively low compared to the accuracy achieved through different architectures. We restrict ourselves from using any other methods that can smooth out the loss landscape and undermine the algorithms’ actual performance. However, we still assert that WSAGrad outperforms under proper initialization of the parameters. On CIFAR − 10, the accuracy achieved by WSAGrad is better than all the baselines under the settings of different activation types.

5 Conclusion and future work

The present work introduces a novel step size to operate on the vanishing gradient problem under extreme nonconvexity, to train deep neural networks. The experimental findings under various conditions show that the proposed model is considerably better than the existing baselines for the classification task. Even for a simple convex problem, we have shown its preeminence. The beauty of our algorithm lies in its relationship with the geometry of the data and loss function. The proposed step size carries the geometric information and maps it as required, nullifying the divergence scope. The effectiveness of the proposed model can be observed, with an overall average performance gain of 3 − 4% achieved on the standard metric (Accuracy) for the classification task. With respect to sigmoid activation the average performance gain of WSAGrad for the basic architecture i.e., MLP and CNN is nearly 5 − 6%. For advanced architectures such as VGG and ResNet, we obtained an average performance gain of 1 − 2% for SoftMax activation. Nevertheless, the proposed model is expected to perform well on similar objective tasks such as language modeling, neural translation, image reconstruction and many other applications in various domains. We plan to conduct the experiments on many diversified datasets in those domains as our future work. Additionally, it is essential to rigorously understand the behavior and be aware of potential pitfalls while using these methods in practice. We believe this paper is the first step in this direction and suggests a good design for faster and better optimization.