1 Introduction

Stochastic gradient descent (SGD) [37] has been the workhorse for solving large-scale machine learning (ML) problems. It gives rise to a family of algorithms that enables efficient training of many ML models including deep neural nets (DNNs). SGD utilizes training data very efficiently at the beginning of the training phase, as it converges much faster than GD and L-BFGS during this period [8, 16]. Moreover, the variance of SGD can help gradient-based optimization algorithms circumvent local minima and saddle points and reach those that generalize well [18, 38]. However, the variance of SGD also slows down the convergence after the first few training epochs. To account for the effect of SGD’s variance and to ensure the convergence of SGD, a decaying step size has to be applied which is one of the major bottlenecks for the fast convergence of SGD [7, 40, 41]. Moreover, in training many ML models, typically the stage-wise schedule of learning rate is used in practice [38, 39]. In this scenario, the variance of SGD usually leads to a large optimality gap.

Through numerical evidence of the bottlenecks of SGD in deep learning, we answer the following question in this work: Can we improve SGD such that the variance of the stochastic gradient is reduced on-the-fly with negligible extra computational and memory overhead and a larger step size is allowed to train ML models?

We answer the above question affirmatively by applying the discrete one-dimensional Laplacian smoothing (LS) operator to smooth the stochastic gradient vector on-the-fly. The LS operation can be performed efficiently by using the fast Fourier transform (FFT).

1.1 Our contribution

In this paper, we propose a new modification to the stochastic gradient-based algorithms, which at its core uses the LS operator to reduce the variance of stochastic gradient vector on-the-fly. The (stochastic) gradient smoothing can be done by multiplying the gradient by the inverse of the following circulant convolution matrix

$$\begin{aligned} {\varvec{A}}_\sigma := \begin{bmatrix} 1+2\sigma &{} -\sigma &{} 0&{}\dots &{}0&{} -\sigma \\ -\sigma &{} 1+2\sigma &{} -\sigma &{} \dots &{}0&{}0 \\ 0 &{} -\sigma &{} 1+2\sigma &{} \dots &{} 0 &{} 0 \\ \dots &{} \dots &{} \dots &{}\dots &{} \dots &{} \dots \\ -\sigma &{}0&{} 0 &{} \dots &{}-\sigma &{} 1+2\sigma \end{bmatrix} \end{aligned}$$
(1)

for some positive constant \(\sigma \ge 0\). In fact, we can write \({\varvec{A}}_\sigma = {\varvec{I}}-\sigma {\varvec{L}}\), where \({\varvec{I}}\) is the identity matrix, and \({\varvec{L}}\) is the discrete one-dimensional Laplacian which acts on indices. If we define the (periodic) forward finite difference matrix as

$$\begin{aligned} {\varvec{D}}_+= \begin{bmatrix} -1 &{} 1 &{} 0&{}\dots &{}0&{} 0 \\ 0 &{} -1 &{} 1 &{} \dots &{}0&{}0 \\ 0 &{} 0 &{} -1 &{} \dots &{} 0 &{} 0 \\ \dots &{} \dots &{} \dots &{}\dots &{} \dots &{} \dots \\ 1 &{}0&{} 0 &{} \dots &{}0 &{} -1 \end{bmatrix}, \end{aligned}$$

then, we have \({\varvec{A}}_\sigma = {\varvec{I}}- \sigma {\varvec{D}}_-{\varvec{D}}_+\), where \({\varvec{D}}_- = -{\varvec{D}}_+^\top \) is the backward finite difference.

We summarize the benefits of this simple LS operation below:

  • It reduces the variance of stochastic gradient on-the-fly and reduces the optimality gap when constant step size is used.

  • It allows us to take a larger step size than the standard (S)GD.

  • It is applicable to train many ML models including DNNs with better generalization.

  • It converges faster for the strongly convex quadratic objective functions that have a large condition numberFootnote 1 numerically.

  • It is more robust to the noisy gradient.

Moreover, as a straightforward extension, we generalize the LS to high-order smoothing operators, e.g., biharmonic smoothing.

1.2 Related work

There is an extensive volume of research over the past decades for designing algorithms to speed up the convergence. These include using momentum and other heavy-ball methods, reduce the variance of the stochastic gradient, and adapt the learning rate.

The first type of idea to accelerate the convergence of GD and SGD is to apply the momentum. Around local optima, the surface curves can be much more steep in one dimension than in another [43], whence (S)GD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum. Momentum is proposed to accelerate (S)GD in the relevant direction and dampens oscillations [34]. Nesterov accelerated gradient (NAG) is also introduced to slow down the progress before the surface curve slopes up, and it provably converges faster in specific scenarios [31].

Due to the bottleneck of the variance of the stochastic gradient, a natural idea is to reduce the variance of the stochastic gradient. There are several principles in developing variance reduction algorithms [8], including dynamic sample size methods; gradient aggregation, control variate type of technique, is widely used along this direction; some representative works are SAGA [11], SCSG [24], and SVRG [19]; and iterative averaging methods.

Another category of work tries to speed up the convergence of GD and SGD by using an adaptive step size, which makes use of the historical gradient to adapt the step size. RMSProp [44] and Adagrad [13] adapt the learning rate to the parameters, performing smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and more substantial updates (i.e., high learning rates) for parameters associated with infrequent features. Both RMSProp and Adagrad make the learning rate to be historical gradient dependent. Adadelta [48] extends the idea of RMSProp and Adagrad; instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size w. Adam [21] and AdaMax [21] behave like a heavy ball with friction, and they compute the decaying averages of past and past squared gradients to adaptive the learning rate. AMSGrad [36] fixes the issue of Adam that may fail to converge to an optimal solution. Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients. Since NAG is superior to vanilla momentum, Dozat [12] proposed NAdam which combines the idea Adam and NAG.

1.3 Notations

Throughout this paper, we use boldface upper-case letters \({\varvec{A}}\), \({\varvec{B}}\) to denote matrices and boldface lower-case letters \({\varvec{w}}\), \({\varvec{u}}\) to denote vectors. For vectors, we use \(\Vert \cdot \Vert \) to denote the \(\ell _2\)-norm for vectors and spectral norm for matrices, respectively. And we use \(\lambda _{max}({\varvec{A}})\), \(\lambda _{min}({\varvec{A}})\), and \(\lambda _i({\varvec{A}})\) to denote the largest, smallest, and the i-th largest eigenvalues, respectively, of a given Hermitian matrix \({\varvec{A}}\). For a function \(f: \mathbb {R}^n\rightarrow \mathbb {R}\), we use \(\nabla f\) and \(\nabla ^2 f\) to denote its gradient and Hessian, and \(f^*\) to denote a local minimum value of f. For a positive definite matrix \({\varvec{A}}\), we define the vector induced norm by as \(\Vert {\varvec{w}}\Vert _{\varvec{A}}:=\sqrt{\langle {\varvec{w}}, {\varvec{A}}{\varvec{w}}\rangle }\). List \(\{1, 2, \cdots , n\}\) is denoted by [n].

1.4 Organization

We organize this paper as follows: In Sect. 2, we introduce the LS-(S)GD algorithm and the FFT-based fast solver. In Sect. 3, we show that LS-(S)GD allows us to take a larger step size than (S)GD. In Sect. 4, we show that LS reduces the variance of SGD both empirically and theoretically. We show that LS-GD can avoid some local minima and speed up convergence numerically in Sect. 5. In Sect. 6, we show the benefit of LS in deep learning, including training LeNet [23], ResNet [17], Wasserstein generative adversarial nets (WGAN) [27], and deep reinforcement learning (DRL) model. The convergence analysis for LS-(S)GD is provided in Sect. 7. Most of the technical proofs are provided in Sect. 8.2.

2 Laplacian smoothing (stochastic) gradient descent

We present our algorithm for SGD in the finite-sum setting. The GD and other settings follow straightforwardly. Consider the following finite-sum optimization

$$\begin{aligned} \min _{{\varvec{w}}} F({\varvec{w}}) :=\frac{1}{n}\sum _{i=1}^nf_i({\varvec{w}}), \end{aligned}$$
(2)

where \(f_i({\varvec{w}}) \doteq f({\varvec{w}}, {\varvec{x}}_i, y_i)\) is the loss of a given ML model on the training data \(\{{\varvec{x}}_i, y_i\}\). This finite-sum formalism is an abstract of training many ML models mentioned above. To resolve the optimization problem (2), starting from some initial guess \({\varvec{w}}^0\), the \((k+1)\)-th iteration of SGD reads

$$\begin{aligned} {\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta _k \nabla f_{i_k}({\varvec{w}}^k), \end{aligned}$$
(3)

where \(\eta _k\) is the step size, \(i_k\) is a random sample with replacement from [n].

We propose to replace the stochastic gradient \(\nabla f_{i_k}({\varvec{w}}^k)\) by the Laplacian smoothed surrogate, and we call the resulting algorithm LS-SGD, which is written as

$$\begin{aligned} {\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta _k {\varvec{A}}_\sigma ^{-1} \nabla f_{i_k}({\varvec{w}}^k). \end{aligned}$$
(4)

Intuitively, compared to the standard GD, this scheme smooths the gradient on-the-fly by an elliptic smoothing operator while preserving the mean of the entries of the gradient. We adopt fast Fourier transform (FFT) to compute \({\varvec{A}}_\sigma ^{-1}\nabla f({\varvec{w}}^{k})\), which is available in both PyTorch [33] and TensorFlow [2]. Given a vector \({\varvec{g}}\), a smoothed vector \({\varvec{d}}\) can be obtained by computing \({\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}\). This is equivalent to \({\varvec{g}}={\varvec{d}}-\sigma {\varvec{v}}*{\varvec{d}}\), where \({\varvec{v}}= [-2, 1, 0, \cdots , 0, 1]^\top \) and \(*\) is the convolution operator. Therefore,

$$\begin{aligned} {\varvec{d}}= \mathrm{ifft}\left( \frac{\mathrm{fft}({\varvec{g}})}{\mathbf {1} -\sigma \cdot \mathrm{fft}({\varvec{v}})}\right) , \end{aligned}$$

where we use component-wise division (here, \(\mathrm{fft}\) and \(\mathrm{ifft}\) are the FFT and inverse FFT, respectively). Hence, the gradient smoothing can be done in quasilinear time. This additional time complexity is almost the same as performing a one-step update on the weights vector \({\varvec{w}}\). For many machine learning models, we may need to concatenate the parameters into a vector. This reshaping might lead to some ambiguity, nevertheless, based on our tests, both row and column majored reshaping work for the LS-GD algorithm. Moreover, in deep learning cases, the weights in different layers might have different physical meanings. For these cases, we perform layer-wise gradient smoothing, instead. We summarize the LS-SGD for solving the finite-sum optimization (2) in Algorithm 1.

figure a

Remark 1

In image processing and elsewhere, the Sobolev gradient [20] uses a multi-dimensional Laplacian operator that operates on \({\varvec{w}}\) and is different from the one-dimensional discrete Laplacian operator employed in our LS-GD scheme that operates on indices.

It is worth noting that LS is a complement to the momentum and adaptive learning rate, e.g., Adagrad algorithms. It can be combined with these acceleration techniques to speed up the convergence. We will show the performance of these algorithms in Sect. 6.

2.1 Generalized smoothing gradient descent

We can generalize \({\varvec{A}}_\sigma \) to the n-th-order discrete hyper-diffusion operator as follows

$$\begin{aligned} {\varvec{I}}+ (-1)^n \sigma {\varvec{L}}^n \doteq {\varvec{A}}_\sigma ^n. \end{aligned}$$

Each row of the discrete Laplacian operator \({\varvec{L}}\) consists of an appropriate arrangement of weights in central finite difference approximation to the second-order derivative. Similarly, each row of \({\varvec{L}}^n\) is an arrangement of the weights in the central finite difference approximation to the 2n-th-order derivative.

Remark 2

The n-th-order smoothing operator \({\varvec{I}}+ (-1)^n \sigma {\varvec{L}}^n\) can only be applied to the problem with dimension at least \(2n+1\). Otherwise, we need to add dummy variables.

Again, we apply FFT to compute the smoothed gradient vector. For a given gradient vector \({\varvec{g}}\), the smoothed surrogate, \(({\varvec{A}}_\sigma ^n)^{-1}{\varvec{g}}\doteq {\varvec{d}}\), can be obtained by solving \({\varvec{g}}={\varvec{d}}+(-1)^n\sigma {\varvec{v}}_n*{\varvec{d}}\), where \({\varvec{v}}_n = (c^n_{n+1}, c^n_{n+2}, \cdots , c^n_{2n+1}, 0, \cdots , 0, c^n_1,\) \( c^n_2, \cdots , c^n_{n-1}, c^n_n)\) is a vector of the same dimension as the gradient to be smoothed. And the coefficient vector \({\varvec{c}}^n = (c^n_1, c^n_2, \cdots , c^n_{2n+1})\) can be obtained recursively by the following formula:

$$\begin{aligned} {\varvec{c}}^1 = (1, -2, 1), \quad c^n_i = {\left\{ \begin{array}{ll} 1 &{}\hbox { }\ i = 1, 2n+1,\\ -2c^{n-1}_1+c^{n-1}_2 &{}\hbox { }\ i = 2, 2n,\\ c^{n-1}_{i-1}-2c^{n-1}_i + c^{n-1}_{i+1} &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

Remark 3

The computational complexities for different order smoothing schemes are the same when the FFT is utilized for computing the surrogate gradient.

3 The choice of step size

In this section, we will discuss the step size issue of LS-(S)GD with a theoretical focus on LS-GD on the function with L-Lipschitz gradient.

Definition 1

We say the function F has L-Lipschitz gradient, if for any \({\varvec{w}}, {\varvec{u}}\in \mathbb {R}^m\), we have \(\Vert \nabla F({\varvec{w}}) - \nabla F({\varvec{u}})\Vert \le L\Vert {\varvec{w}}- {\varvec{u}}\Vert \).

For the function with L-Lipschitz gradient, it is known that the largest suitable step size for GD is \(\eta _{max}^{GD} = 1/L\) [32]. In the following, we will establish a \(\ell _2\) estimate of the square root of the LS operator when it is applied to an arbitrary vector. Based on these estimates, we will show that LS-GD can take a larger step size than GD.

To determine the largest suitable step size for LS-GD, we first do a change of variable in the LS-GD (2) by letting \({\varvec{y}}^k = {\varvec{H}}^{-1}_\sigma {\varvec{w}}^k\) where \({\varvec{H}}_\sigma ={\varvec{A}}_\sigma ^{-1/2}\), and then LS-GD can be written as

$$\begin{aligned} {\varvec{y}}^{k+1} = {\varvec{y}}^k - \eta _k {\varvec{H}}_\sigma \nabla F({\varvec{H}}_\sigma {\varvec{y}}^k), \end{aligned}$$
(5)

which is actually the GD for solving the following minimization problem

$$\begin{aligned} \min _{\varvec{y}}F({\varvec{H}}_\sigma {\varvec{y}}) := \min _{{\varvec{y}}} G({\varvec{y}}). \end{aligned}$$
(6)

Therefore, to determine the largest suitable step size for LS-GD, it is equivalent to find the largest appropriate step size for GD for \(\min _{{\varvec{w}}} G({\varvec{w}})\) (here we use \({\varvec{w}}\) instead of \({\varvec{y}}\) for the ease of notation).

3.1 \(\ell _2\) estimates

First, for any \({\varvec{w}}, {\varvec{u}}\in {\mathbb {R}}^m\), suppose the function F has L-Lipschitz gradient, i.e.,

$$\begin{aligned} \Vert \nabla F({\varvec{w}}) - \nabla F({\varvec{u}})\Vert \le L\Vert {\varvec{w}}-{\varvec{u}}\Vert . \end{aligned}$$

Then

$$\begin{aligned} \Vert \nabla G({\varvec{w}}) - \nabla G({\varvec{u}})\Vert&= \Vert {\varvec{H}}_\sigma (\nabla F({\varvec{w}}) - \nabla F({\varvec{u}}))\Vert \\&\le \Vert {\varvec{H}}_\sigma \Vert \Vert \nabla F({\varvec{w}})-\nabla F({\varvec{u}})\Vert \\&\le \Vert \nabla F({\varvec{w}})-\nabla F({\varvec{u}})\Vert , \end{aligned}$$

where the last inequality is because \(\Vert {\varvec{H}}_\sigma \Vert \)’s eigenvalues are all not greater than one. This estimate indicates that we can use a step size at least 1/L in GD for searching the optimum of the function G, i.e., we can use a step size at least 1/L in LS-GD for searching the optimum of the function F.

Next, we establish a high probability estimation for taking a larger step size when using LS-GD. Without any prior knowledge about \({\varvec{v}}:=\nabla F({\varvec{w}}) - \nabla F({\varvec{u}})\), let us assume it is sampled uniformly from a ball in \(\mathbb {R}^m\) centered at the origin. Without loss of generality, we assume the radius of this ball is one. Under the above ansatz, we have the following result

Theorem 1

(\(\ell _2\)-estimate) Let \(\sigma >0\), and

$$\begin{aligned} \beta = \frac{1}{m}\sum _{i=1}^m \frac{1}{|1+2\sigma -\sigma z_i - \sigma \overline{z_i}| }, \end{aligned}$$

where \(z_1\), \(\cdots \), \(z_m\) are the m roots of unity. Let \({\varvec{v}}\) be uniformly distributed in the unit ball of the m-dimensional \(\ell _2\) space. Then for any \(\alpha > \frac{\sqrt{\beta }}{1-\frac{\pi }{\sqrt{m}}}\), we have

$$\begin{aligned} \mathbb {P}\left( \Vert {\varvec{H}}_\sigma {\varvec{v}}\Vert \ge \alpha \Vert {\varvec{v}}\Vert \right) \le 2\exp {\left( -\frac{2}{\pi ^2}m\left( \frac{\alpha -\alpha \frac{\pi }{\sqrt{m}} -\sqrt{\beta } }{\alpha +1} \right) ^2 \right) }. \end{aligned}$$
(7)

The proof of this theorem is provided in the appendix. For high-dimensional ML problems, e.g., training DNNs, m can be as large as tens of millions so that the probability will be almost one. The closed form of \(\beta \) is given in Lemma 1.

Lemma 1

If \(z_1,\ldots ,z_m\) denote the m roots of unity, then

$$\begin{aligned} \beta = \frac{1}{m}\sum _{j=1}^{m} \frac{1}{|1+2\sigma -\sigma z_j - \sigma \bar{z_j}| } = \frac{1+\omega ^m}{(1-\omega ^m)\sqrt{1+4\sigma }} \rightarrow _{m\rightarrow \infty } \frac{1}{\sqrt{1+4\sigma }}, \end{aligned}$$
(8)

where \(\omega = \frac{2\sigma +1 - \sqrt{1+4\sigma }}{2\sigma } < 1.\)

The proof of the above lemma needs some tools from complex and harmonic analysis, which is provided in the appendix. Table 1 lists some typical values for different \(\sigma \) and m.

Table 1 The values of \(\beta \) corresponding to some \(\sigma \) and m

Based on Theorem 1, LS-GD can take the largest step size \(\frac{1}{\sqrt{\beta } L}\) for high-dimensional function with L-Lipschitz gradient with high probability. We will numerically verify this later.

4 Variance reduction

The variance of SGD is one of the major bottlenecks that slows down the theoretical guaranteed convergence rate in training ML models. Most of the existing variance reduction algorithms require either the full batch gradient or the storage of stochastic gradient for each data point which makes it difficult to be used to train the high-capacity DNNs. LS is an alternative approach to reduce the variance of the stochastic gradient with negligible extra computational time and memory cost. In this section, we rigorously show that LS reduces the variance of the stochastic gradient and reduce the optimality gap under the Gaussian noise assumption. Moreover, we numerically verify our theoretical results on both a quadratic function and a simple finite-sum optimization problem.

4.1 Gaussian noise scenario

Stochastic gradient \(\nabla f_{i_k}\), for any \(i_k \in [n]\), is an unbiased estimate of \(\nabla F\); many existing works model the variance between the stochastic gradient and full batch gradient \(\nabla F\) as Gaussian noise \(\mathcal {N}(\mathbf {0}, \Sigma )\), where \(\Sigma \) is the covariance matrix [28]. Therefore, ignoring the variable \({\varvec{w}}\) for simplicity of notation, we can write the equation involving gradient and stochastic gradient vectors as

$$\begin{aligned} \nabla f_{i_k} = \nabla F + {\varvec{n}}, \end{aligned}$$
(9)

where \({\varvec{n}}\sim \mathcal {N}(\mathbf {0}, \Sigma )\). Thus, for LS stochastic gradient, we have

$$\begin{aligned} {\varvec{A}}_\sigma ^{-1}\nabla f_{i_k} = {\varvec{A}}_\sigma ^{-1}\left( \nabla F + {\varvec{n}}\right) . \end{aligned}$$
(10)

The variances of stochastic gradient and LS stochastic gradient are basically the variance of \({\varvec{n}}\) and \({\varvec{A}}_\sigma ^{-1}{\varvec{n}}\), respectively. The following theorem quantifies the variance between \({\varvec{n}}\) and \({\varvec{A}}_\sigma ^{-1}{\varvec{n}}\).

Theorem 2

Let \(\kappa \) denote the condition number of \(\Sigma \). Then, for m-dimensional Gaussian random vector \({\varvec{n}}\sim \mathcal {N}(\mathbf {0}, \Sigma )\), we have

$$\begin{aligned} \frac{\sum _{i=1}^m \mathrm {Var}[\left( ({\varvec{A}}_\sigma ^n)^{-1} {\varvec{n}}\right) _i]}{\sum _{i=1}^m\mathrm {Var}[ \left( {\varvec{n}}\right) _i ]} \le 1 - \frac{1}{\kappa } + \frac{1}{\kappa m}\sum _{j=0}^{m}\frac{1}{[1+4^n\sigma \sin ^{2n}(\pi j/m)]^2}. \end{aligned}$$
(11)

The proof of Theorem 2 will be provided in the appendix.

Table 2 lists the ratio of variance after and before LS for an m-D standard normal vector \({\varvec{n}}\sim \mathcal {N}(\mathbf {0}, {\varvec{I}})\). In practice, high-order smoothing reduces variance more significantly.

Table 2 Theoretical upper bound of \(\sum _{i=1}^m \mathrm {Var}[\left( ({\varvec{A}}_\sigma ^n)^{-1} {\varvec{n}}\right) _i]/\sum _{i=1}^m\mathrm {Var}[ \left( {\varvec{n}}\right) _i ]\) when \(\mathbf {n}\) is an m-dimensional standard normal vector with \(m\ge 10000\)

Moreover, LS preserves the mean (Proposition 1), decreases the largest component and increases the smallest component (Proposition 2) for any vector.

Proposition 1

For any vector \({\varvec{g}}\in \mathbb {R}^m\), \({\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}\), let \(j_{\max } = \arg \max _i d_i\) and \(j_{\min } = \arg \min _i d_i \). We have \(\max _i d_i = d_{j_{\max }} \le g_{j_{\max }} \le \max _i g_i\) and \(\min _i d_i = d_{j_{\min }} \ge g_{j_{\min }} \ge \min _i g_i\).

Proof

Since \({\varvec{g}}= {\varvec{A}}_{\sigma }{\varvec{d}}\), it holds that

$$\begin{aligned} g_{j_{\max }} = d_{j_{\max }} + \sigma ( 2 d_{j_{\max }} - d_{j_{\max }-1} - d_{j_{\max }+1}), \end{aligned}$$

where periodicity of index is used if necessary. Since \( 2 d_{j_{\max }} - d_{j_{\max }-1} - d_{j_{\max }+1}\ge 0\), We have \(\max _i d_i = d_{j_{\max }} \le g_{j_{\max }} \le \max _i g_i\). Similarly, we can show \(\min _i d_i = d_{j_{\min }} \ge g_{j_{\min }} \ge \min _i g_i\). \(\square \)

Proposition 2

The operator \({\varvec{A}}_\sigma ^{-1}\) preserves the sum of components. For any \({\varvec{g}}\in \mathbb {R}^m\) and \({\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}\), we have \(\sum _j d_j = \sum _j g_j\), or equivalently, \(\mathbf {1}^\top {\varvec{d}}= \mathbf {1}^\top {\varvec{g}}\).

Proof

Since \({\varvec{g}}= {\varvec{A}}_\sigma {\varvec{d}}\),

$$\begin{aligned} \sum _i g_i = \mathbf {1}^\top \mathbf {g} = \mathbf {1}^\top ({\varvec{I}}+ \sigma {\varvec{D}}_+^\top {\varvec{D}}_+){\varvec{d}}= \mathbf {1}^\top {\varvec{d}}= \sum _i d_i, \end{aligned}$$

where we used \({\varvec{D}}_+\mathbf {1} = \mathbf {0}\). \(\square \)

4.2 Reduce the optimality gap

A direct benefit of variance reduction is that it reduces the optimality gap in SGD when constant step size is applied.

Proposition 3

Suppose f is convex with a global minimizer \({\varvec{w}}^*\), and \(f^* = f({\varvec{w}}^*)\). Consider the following iteration with constant learning rate \(\eta >0\)

$$\begin{aligned} {\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta ({\varvec{A}}_\sigma ^n)^{-1} {\varvec{g}}^k \end{aligned}$$

where \({\varvec{g}}^k\) is the sampled gradient in the k-th iteration at \({\varvec{w}}^k\) satisfying \(\mathbb {E}[{\varvec{g}}^k] = \nabla f({\varvec{w}}^k)\). Denote \(G_{{\varvec{A}}_\sigma ^n}: = \frac{1}{K}\sum _{k=0}^{K-1}{\mathbb {E}}\Vert {\varvec{g}}^k\Vert ^2_{({\varvec{A}}_\sigma ^n)^{-1}}\) and \({\overline{{\varvec{w}}}}^K := \sum _{k=0}^{K-1} {\varvec{w}}^k /K\) the ergodic average of iterates. Then the optimality gap after K-th iteration is

$$\begin{aligned} \mathbb {E}[f({\overline{{\varvec{w}}}}^K)] - f^* \le {\frac{1}{2\eta K} \mathbb {E}[\Vert {\varvec{w}}^{0} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2]} + \frac{\eta G_{{\varvec{A}}_\sigma ^n}}{2}. \end{aligned}$$

Proof

Since f is convex, we have

$$\begin{aligned} \langle \nabla f({\varvec{w}}^k), {\varvec{w}}^{k} - {\varvec{w}}^* \rangle \ge f({\varvec{w}}^k) - f^*. \end{aligned}$$
(12)

Furthermore,

$$\begin{aligned}&\; \mathbb {E}[\Vert {\varvec{w}}^{k+1} -{\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}^2] = \mathbb {E}[\Vert {\varvec{w}}^{k} - \eta ({\varvec{A}}_\sigma ^n)^{-1}{\varvec{g}}^k -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] \\&\quad = \; \mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert ^2_{{\varvec{A}}_\sigma ^n}] - 2\eta \mathbb {E}[\langle {\varvec{g}}^k, {\varvec{w}}^k - {\varvec{w}}^* \rangle ]+ \eta ^2\mathbb {E}[\Vert ({\varvec{A}}_\sigma ^n)^{-1}{\varvec{g}}^{{k}}\Vert _{{\varvec{A}}_\sigma ^n}^2] \\&\quad \le \; \mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] - 2\eta \mathbb {E}[\langle \nabla f({\varvec{w}}^k), {\varvec{w}}^k - {\varvec{w}}^* \rangle ]+ \eta ^2{\mathbb {E}} \Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 \\&\quad \le \; \mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] - 2\eta (\mathbb {E}[f({\varvec{w}}^k)]-f^*) + \eta ^2 {\mathbb {E}}\Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2, \end{aligned}$$

where the last inequality is due to (12). We rearrange the terms and arrive at

$$\begin{aligned} \mathbb {E}[ f({\varvec{w}}^k)]-f^*\le&\, \frac{1}{2\eta }(\mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] - \mathbb {E}[\Vert {\varvec{w}}^{k+1} -{\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}^2]) + \frac{\eta {\mathbb {E}} \Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 }{2}. \end{aligned}$$

Summing over k from 0 to \(K-1\) and averaging and using the convexity of f, we have

$$\begin{aligned} \mathbb {E}[f({\overline{{\varvec{w}}}}^K)] - f^*&\le \; \frac{\sum _{k=0}^{K-1} \mathbb {E}[f({\varvec{w}}^k)]}{K} - f^*\\&\le \frac{1}{2\eta K} \mathbb {E}[\Vert {\varvec{w}}^{0} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] + \frac{\sum _{k=0}^{K-1}{\mathbb {E}}\Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 }{2K} \eta . \end{aligned}$$

\(\square \)

Remark 4

Since \(G_{{\varvec{A}}_\sigma ^n}\) is smaller than the corresponding value without LS, it shows that the optimality gap is reduced when LS is used with a constant step size. In practice, this is also true for the stage-wise step size since it is a constant in each stage of the training phase.

4.2.1 Optimization for quadratic function with noise corrupted gradient

In this part, we empirically show the advantages of the LS-(S)GD and its generalized schemes for the convex optimization problems with noisy gradient. Consider searching the minima \({\varvec{x}}^*\) of the quadratic function \(f({\varvec{x}})\) defined in (13).

$$\begin{aligned} f(x_1, x_2, \cdots , x_{100}) = \sum _{i=1}^{50}x_{2i-1}^2 + \sum _{i=1}^{50} \frac{x_{2i}^2}{10^2}. \end{aligned}$$
(13)

Here, we consider the gradient with Gaussian noise injection, i.e., at any given point \({\varvec{x}}\), we have

$$\begin{aligned} {\tilde{\nabla }}_\epsilon f({\varvec{x}}) := \nabla f({\varvec{x}}) + \epsilon \mathcal {N}(\mathbf {0}, {\varvec{I}}), \end{aligned}$$

where the scalar \(\epsilon \) controls the noise level, \(\mathcal {N}(\mathbf {0}, {\varvec{I}})\) is the Gaussian noise vector with zero mean and unit variance in each coordinate. The corresponding numerical schemes can be formulated as

$$\begin{aligned} {\varvec{x}}^{k+1} = {\varvec{x}}^k - \eta _k ({\varvec{A}}_\sigma ^n)^{-1} {\tilde{\nabla }}_\epsilon f({\varvec{x}}^k), \end{aligned}$$
(14)

where \(\sigma \) is the smoothing parameter selected to be 10.0 to remove the intense noise. We take diminishing step sizes with initial values 0.1 for SGD/smoothed SGD; 0.9 and 1.8 for GD/smoothed GD, respectively. Without noise, the smoothing allows us to take larger step sizes; rounding to the first digit, 0.9 and 1.8 are the largest suitable step size for GD and smoothed version here. We study both constant learning rate and exponentially decaying learning rate, i.e., after every 1000 iteration the learning rate is divided by 10. We apply different schemes that corresponding to \(n=0, 1, 2\) in (14) to the problem ((13)), with the initial point \({\varvec{x}}^0=(1, 1, \cdots , 1)\).

Figure 1 shows the iteration versus optimality gap when the constant learning rate is used. In the noise free case, all three schemes converge linearly. When there is noise, our smoothed gradient helps to reduce the optimality gap and converges faster after a few iterations.

Fig. 1
figure 1

Iterations versus optimality gap for GD and smoothed GD with order 1 and order 2 smoothing for the problem in (13). Constant step size is used

The exponentially decaying learning rate helps our smoothed SGD to reach a point with a smaller optimality gap, and the higher-order smoothing further reduces the optimality gap, as shown in Fig. 2. This is due to the noise removal properties of the smoothing operators.

Fig. 2
figure 2

Iterations versus optimality gap for GD and smoothed GD with order 1 and 2 smoothing for the problem in (13). Exponentially decaying step size is utilized here

4.2.2 Find the center of multiple points

Consider finding the center of a given set of 5K points \(\{\mathbf {x}_i\in \mathbb {R}^{50}\}_{i=1}^{5000}\). It can be formulated as the following finite-sum optimization

$$\begin{aligned} \min _{\varvec{x}}F({\varvec{x}}) := \frac{1}{N}\sum _{i=1}^N f_i({\varvec{x}}) = \frac{1}{N}\sum _{i=1}^N \Vert {\varvec{x}}_i - {\varvec{x}}\Vert ^2. \end{aligned}$$
(15)

We solve this optimization problem by running either SGD or LS-SGD for 20K iterations starting from the same random initial point with batch size 20. The initial step size is set to be 1.0 and 1.2, respectively, for SGD and LS-SGD, and decays 1.1 times after every 10 iterations. As the learning rate decays, the variance of the stochastic gradient decays [46]; thus, we decay \(\sigma \) 10 times after every 1K iterations. Figure 3a plots a 2D cross section of the trajectories of SGD and LS-SGD, and it shows that the trajectory of SGD is more noisy than that of LS-SGD. Figure 3b plots the iteration versus loss for both SGD and LS-SGD averaged over 3 independent runs. LS-SGD converges faster than SGD and has a smaller optimality gap than LS-SGD. This numerical result verifies our theoretical results on the optimality gap (Proposition 3).

Fig. 3
figure 3

Left: 2D trajectories of SGD and LS-SGD. Right: Iteration versus Loss for SGD and LS-SGD

4.2.3 Multi-class logistic regression

Consider applying the proposed optimization schemes to train the multi-class logistic regression model. We run 200 epochs of SGD and different order smoothing algorithms to maximize the likelihood of multi-class logistic regression with batch size 100. And we apply the exponentially decaying learning rate with initial value 0.5 and decay 10 times after every 50 epochs. We train the model with only 10 \(\%\) randomly selected MNIST training data and test the trained model on the entire testing images. We further compare with SVRG under the same setting. Figure 4 shows the histograms of generalization accuracy of the model trained by SGD (a); SVRG (b); LS-SGD (order 1) (c); LS-SGD (order 2) (d). It is seen that SVRG somewhat improves the generalization with higher averaged accuracy. However, the first- and the second-order LS-SGD type algorithms lift the averaged generalization accuracy by more than \(1 \%\) and reduce the variance of the generalization accuracy over 100 independent trials remarkably.

4.3 Iteration versus loss

In this part, we show the evolution of the loss in training the multi-class logistic regression model by SGD, SVRG, LS-GD with first- and second-order smoothing, respectively. As illustrated in Fig. 5. At each iteration, among 100 independent experiments, SGD has the largest variance, and SGD with first-order smoothed gradient significantly reduces the variance of loss among different experiments. The second-order smoothing can further reduce the variance. The variance of loss in each iteration among 100 experiments is minimized when SVRG is used to train the multi-class logistic model. However, the generalization performance of the model trained by SVRG is not as good as the ones trained by LS-SGD, or higher-order smoothed gradient descent (Fig. 4b).

Fig. 4
figure 4

Histogram of testing accuracy over 100 independent experiments of the multi-class logistic regression model trained on randomly selected \(10\%\) MNIST data by different algorithms

Fig. 5
figure 5

Iterations versus loss for SGD, SVRG, and LS-SGD with order 1 and order 2 gradient smoothing for training the multi-class logistic regression model

4.4 Variance reduction in stochastic gradient

We verify the efficiency of variance reduction numerically in this part. We simplify the problem by applying the multi-class logistic regression only to the digits 1 and 2 of the MNIST training data. In order to compute the variance of the (LS)-stochastic gradients, we first compute descent path of (LS)-GD by applying the full batch (LS)-GD with learning rate 0.5 starting from the same random initialization. We record the full batch (LS)-gradient on each point along the descent path. Then we compute the (LS)-stochastic gradients on each points along the path by using different batch sizes and smoothing parameters \(\sigma \). In computing (LS)-stochastic gradients we run 100 independent experiments. Then we compute the variance of the (LS)-stochastic gradient among these 100 experiments and regarding the full batch (LS)-gradient as the mean on each point along the full batch (LS)-GD descent path. For each pair of batch size and \(\sigma \), we report the maximum variance over all the coordinates of the gradient and all the points along the descent path. We list the variance results in Table 3 (note the case \(\sigma =0\) corresponds to the SGD). These results show that compared to the SGD, LS-GD with \(\sigma =3\) can reduce the maximum variance \(\sim {\textbf {100}}\) times for different batch sizes. It is worth noting that the high-order smoothing reduces more variance than the lower-order smoothing; this might be due to the fact that the noise of SGD is not Gaussian.

Table 3 The maximum variance of the stochastic gradient generated by LS-SGD with different \(\sigma \) and batch size. \(\sigma =0\) recovers the SGD

5 Numerical results on avoiding local minima and speeding up convergence

We first show that LS-GD can bypass sharp minima and reach the global minima. We consider the following function, in which we ‘drill’ narrow holes on a smooth convex function,

$$\begin{aligned} \begin{aligned} f(x, y, z)&= -4e^{-\left( (x-\pi )^2+(y-\pi )^2+(z-\pi )^2\right) }-\\&\quad \times 4\sum _{i}\cos (x)\cos (y)e^{-\beta \left( (x-r\sin (\frac{i}{2})-\pi )^2+(y-r\cos (\frac{i}{2})-\pi )^2\right) }, \end{aligned} \end{aligned}$$
(16)

where the summation is taken over the index set \(\{i\in \mathbb {N}| \; 0 \le i < 4\pi \}\), r and \(\beta \) are the parameters that determine the location and narrowness of the local minima and are set to 1 and \(\frac{1}{\sqrt{500}}\), respectively. We do GD and LS-GD starting from a random point in the neighborhoods of the narrow minima, i.e., \((x_0, y_0, z_0)\in \{\bigcup _i U_\delta (r\sin (\frac{i}{2})+\pi , r\cos (\frac{i}{2})+\pi , \pi )| \; 0\le i<4\pi , i\in \mathbb {N}\}\), where \(U_\delta (P)\) is a neighborhood of the point P with radius \(\delta \). Our experiments (Fig. 6) show that, if \(\delta \le 0.2\) GD will converge to a narrow local minima, while LS-GD convergences to the wider global minima.

Fig. 6
figure 6

Demo of GD and LS-GD. a Depicts the slice of the function ((16)) with \(z=2.34\); b shows the paths of GD (red) and LS-GD (black). We take the step size to be 0.02 for both GD and LS-GD. \(\sigma =1.0\) is utilized for LS-GD

Next, let us compare LS-GD with some popular optimization methods on the benchmark 2D-Rosenbrock function which is a non-convex function. The global minimum is inside a long, narrow, parabolic shaped flag valley. To find the valley is trivial. To converge to the global minimum, however, is difficult. The function is defined by

$$\begin{aligned} f(x, y) = (a-x)^2 + b(y-x^2)^2, \end{aligned}$$
(17)

it has a global minimum at \((x, y) = (a, a^2)\), and we set \(a=1\) and \(b=100\) in experiments.

Starting from the initial point with coordinate \((-3, -4)\), we run 2K iterations of the following optimizers including GD, GD with Nesterov momentum [31], Adam [21], RMSProp [44], and LS-GD (\(\sigma =0.5\)). The step size used for all these methods is \(3e-3\). Figure 7 plots the iteration versus objective value, and it shows that GD together with Nesterov momentum converges faster than all the other algorithms. The second best algorithm is LS-GD. Meanwhile, Nesterov momentum can be used to speed up LS-GD, and we will show this numerically in training DNNs in Sect. 6.

Fig. 7
figure 7

Iteration versus loss of different optimization algorithms in optimize the Rosenbrock function

Furthermore, we will show that LS-GD can be further accelerated by using Nesterov momentum. As shown in Fig. 8, the LS-GD together with Nesterov momentum converges much faster than GD with momentum, especially for high-dimensional Rosenbrock function.

Fig. 8
figure 8

Iteration versus objective value for GD with Nesterov momentum and LS-GD with Nesterov momentum

6 Application to deep learning

6.1 Train neural nets with small batch size

Many advanced artificial intelligence tasks make high demands on training neural nets with extremely small batch sizes. The milestone technique for this is group normalization [47]. In this section, we show that LS-SGD successfully trains DNN with extremely small batch size. We consider LeNet-5 [23] for MNIST classification. Our network architecture is as follows

$$\begin{aligned} \text{ LeNet-5: }\ \mathrm{input}_{28 \times 28} \rightarrow \mathrm{conv}_{20, 5, 2} \rightarrow \mathrm{conv}_{50, 5, 2} \rightarrow \mathrm{fc}_{512} \rightarrow \mathrm{softmax}. \end{aligned}$$

The notation \(\mathrm{conv}_{c, k, m}\) denotes a 2D convolutional layer with c output channels, each of which is the sum of a channel-wise convolution operation on the input using a learnable kernel of size \(k \times k\), it further adds ReLU nonlinearity and max pooling with stride size m. \(\mathrm{fc}_{512}\) is an affine transformation that transforms the input to a vector of dimension 512. Finally, the tensors are activated by a multi-class logistic function. The MNIST data are first passed to the layer \(\mathrm{input}_{28 \times 28}\) and further processed by this hierarchical structure. We run 100 epochs of both SGD and LS-SGD with initial learning rate 0.01 and divide by 5 after 50 epochs and use a weight decay of 0.0001 and momentum of 0.9. Figure 9a plots the generalization accuracy on the test set with the LeNet5 trained with different batch sizes. For each batch size, LS-SGD with \(\sigma =1.0\) keeps the testing accuracy more than \(99.4\%\), SGD reduce the accuracy to \(97\%\) when batch size 4 is used. The classification becomes just a random guess, when the model is trained by SGD with batch size 2. Small batch size leads to large noise in the gradient, which may make the noisy gradient not along the descent direction; however, Laplacian smoothing rescues this by decreasing the noise.

Fig. 9
figure 9

a Testing accuracy of LeNet5 trained by SGD/LS-SGD on MNIST with various batch sizes. b The evolution of the pre-activated ResNet56’s training and generalization accuracy by SGD and LS-SGD (start from the 20-th epoch)

6.2 Improve generalization accuracy

On Cifar10 [22], we compare the performance of LS-SGD and SGD on ResNet with the pre-activated ResNet56 as an illustration. We take the same training strategy as that used in [17], except that we run 200 epochs with the learning rate decaying by a factor of 5 after every 40 epochs. For ResNet, instead of applying LS-SGD for all epochs, we only use LS-SGD in the first 40 epochs, and the remaining training is carried out by SGD. (This will save the extra computational cost due to LS, and we noticed that the performance is similar to the case when LS is used for the whole training process.) The parameter \(\sigma \) is set to 1.0. Figure 9b depicts one path of the training and generalization accuracy of the neural nets trained by SGD and LS-SGD, respectively. It is seen that, even though the training accuracy obtained by SGD is higher than that by LS-SGD, the generalization is, however, inferior to that of LS-SGD. We carry out 25 replicas of this experiment; the histograms of the corresponding accuracy are shown in Fig. 10.

Fig. 10
figure 10

The histogram of the generalization accuracy of the pre-activated ResNet56 on Cifar10 trained by SGD and LS-SGD over 25 independent experiments

6.3 Training Wassersterin GAN

Generative adversarial networks (GANs) [15] are notoriously delicate and unstable to train [4]. In [27], Wasserstein-GANs (WGANs) are introduced to combat the instability in the training GANs. In addition to being more robust in training parameters and network architecture, WGANs provide a reliable estimate of the Earth Mover (EM) metric which correlates well with the quality of the generated samples. Nonetheless, WGANs training becomes unstable with a large learning rate or when used with a momentum based optimizer [27]. In this section, we demonstrate that the gradient smoothing technique in this paper alleviates the instability in the training and improves the quality of generated samples. Since WGANs with weight clipping are typically trained with RMSProp [44], we propose replacing the gradient g by a smoothed version \(g_\sigma = {\varvec{A}}_\sigma ^{-1}g\) and also update the running averages using \(g_\sigma \) instead of g. We name this algorithm LS-RMSProp.

To accentuate the instability in training and demonstrate the effects of gradient smoothing, we deliberately use a large learning rate for training the generator. We compare the regular RMSProp with the LS-RMSProp. The learning rate for the critic is kept small and trained approximately to convergence so that the critic loss is still an effective approximation to the Wasserstein distance. To control the number of unknowns in the experiment and make a meaningful comparison using the critic loss, we use the classical RMSProp for the critic and only apply LS-RMSProp to the generator.

Fig. 11
figure 11

Critic loss with learning rate \(lrD = 0.0001\), \(lrG = 0.005\) for RMSProp (left) and LS-RMSProp (right), trained for 20K iterations. We apply a mean filter of window size 13 for better visualization. The loss from LS-RMSProp is visibly less noisy

Fig. 12
figure 12

Samples from WGANs trained with RMSProp (a, c) and LS-RMSProp (b, d). The learning rate is set to \(lrD = 0.0001\), \(lrG = 0.005\) for both RMSProp and LS-RMSProp in a and b. And \(lrD = 0.0001\), \(lrG = 0.0001\) are used for both RMSProp and LS-RMSProp in c and d. The critic is trained for 5 iterations per step of the generator and 200 iterations per every 500 steps of the generator

We train the WGANs on the MNIST dataset using the DCGAN [35] for both the critic and generator. In Fig. 11 (left), we observe the loss for RMSProp trained with a large learning rate has multiple sharp spikes, indicating instability in the training process. The samples generated are also lower in quality, containing noisy spots as shown in Fig. 12a. In contrast, the curve of training loss for LS-RMSProp is smoother and exhibits fewer spikes. The generated samples as shown in Fig. 12b are also of better quality and visibly less noisy. The generated characters shown in Fig. 12b are more realistic compared to the ones shown in Fig. 12a. The effects are less pronounced with a small learning rate, but still result in a modest improvement in sample quality as shown in Fig. 12c, d. We also apply LS-RMSProp for training the critic, but do not see a clear improvement in the quality. This may be because the critic is already trained near optimality during each iteration and does not benefit much from gradient smoothing.

6.4 Deep reinforcement learning

Deep reinforcement learning (DRL) has been applied to playing games including Cartpole [9], Atari [30], Go [29, 42]. DNN plays a vital role in approximating the Q-function or policy function. We apply the Laplacian smoothed gradient to train the policy function to play the Cartpole game. We apply the standard procedure to train the policy function by using the policy gradient [9]. And we use the following network to approximate the policy function:

$$\begin{aligned} \mathrm{input}_4 \rightarrow \mathrm{fc}_{20} \rightarrow \mathrm{relu} \rightarrow \mathrm{fc}_2 \rightarrow \mathrm{softmax}. \end{aligned}$$

The network is trained by RMSProp and LS-RMSProp with \(\sigma =1.0\), respectively. The learning rate and other related parameters are set to be the default ones in PyTorch. The training is stopped once the average duration of 5 consecutive episodes is more than 490. In each training episode, we set the maximal steps to be 500. Left and right panels of Fig. 13 depict a training procedure by using RMSProp and LS-RMSProp, respectively. We see that Laplacian smoothed gradient takes fewer episodes to reach the stopping criterion. Moreover, we run the above experiments 5 times independently and apply the trained model to play Cartpole. The game lasts more than 1000 steps for all the 5 models trained by LS-RMSProp, while only 3 of them lasts more than 1000 steps when the model is trained by vanilla RMSProp.

Fig. 13
figure 13

Durations of the cartpole game in the training procedure. Left and right are training procedure by RMSProp and LS-RMSProp with \(\sigma =1.0\), respectively

7 Convergence analysis

Note that the LS matrix \({\varvec{A}}_\sigma ^{-1}\) is positive definite and its largest and smallest eigenvalues are 1 and \(\frac{1}{1+4\sigma }\), respectively. It is straightforward to show that all the convergence results for (S)GD still hold for LS-(S)GD. In this section, we will show some additional convergence for LS-(S)GD with a focus on LS-GD, the corresponding results for LS-SGD follow in a similar way.

Proposition 4

Consider the algorithm \({\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta _k ({\varvec{A}}_\sigma ^n)^{-1} \nabla f({\varvec{w}}^k)\). Suppose f is L-Lipschitz smooth and \(0< {\tilde{\eta }} \le \eta \le {\bar{\eta }}< \frac{2}{L}\). Then \(\lim _{t\rightarrow \infty } \Vert \nabla f({\varvec{w}}^k)\Vert \rightarrow 0\). Moreover, if the Hessian \(\nabla ^2 f\) of f is continuous and positive definite with \({\varvec{w}}^*\) being the minimizer of f, and \({\bar{\eta }}\Vert \nabla ^2 f\Vert <1\), then \(\Vert {\varvec{w}}^k-{\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}\rightarrow 0\) as \(k\rightarrow \infty \), and the convergence is linear.

Proof

By the Lipschitz continuity of \(\nabla f\) and the descent lemma [5], we have

$$\begin{aligned} f({\varvec{w}}^{k+1})&\; = f({\varvec{w}}^k - \eta _k({\varvec{A}}_\sigma ^n)^{-1}\nabla f({\varvec{w}}^k)) \\&\; \le f({\varvec{w}}^k) - \eta _k \langle \nabla f({\varvec{w}}^k), ({\varvec{A}}_\sigma ^n)^{-1}\nabla f({\varvec{w}}^k))\rangle + \frac{\eta ^2_k L}{2} \Vert ({\varvec{A}}_\sigma ^n)^{-1} \nabla f({\varvec{w}}^k)\Vert ^2 \\&\; \le f({\varvec{w}}^k) - \eta _k \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 + \frac{\eta ^2_k L}{2} \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 \\&\; \le f({\varvec{w}}^k) - {\tilde{\eta }}\left( 1-\frac{{\bar{\eta }} L}{2}\right) \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2. \end{aligned}$$

Summing the above inequality over k, we have

$$\begin{aligned} {\tilde{\eta }}\left( 1-\frac{{\bar{\eta }} L}{2}\right) \sum _{k=0}^{\infty } \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 \le f({\varvec{w}}^0) -\lim _{k\rightarrow \infty } f({\varvec{w}}^k)< \infty . \end{aligned}$$

Therefore, \( \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2\rightarrow 0\), and thus, \( \Vert \nabla f({\varvec{w}}^k)\Vert \rightarrow 0\).

For the second claim, we have

$$\begin{aligned}&{\varvec{w}}^{k+1} -{\varvec{w}}^* \\&\quad = \; {\varvec{w}}^k - {\varvec{w}}^* -\eta _k ({\varvec{A}}_\sigma ^n)^{-1}(\nabla f({\varvec{w}}^k) - \nabla f({\varvec{w}}^*)) \\&\quad = \; {\varvec{w}}^k - {\varvec{w}}^* -\eta _k ({\varvec{A}}_\sigma ^n)^{-1} \left( \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\cdot ({\varvec{w}}^k - {\varvec{w}}^*) \mathrm {d}\tau \right) \\&\quad = \; {\varvec{w}}^k - {\varvec{w}}^* -\eta _k ({\varvec{A}}_\sigma ^n)^{-1} \left( \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\mathrm {d}\tau \cdot ({\varvec{w}}^k - {\varvec{w}}^*)\right) \\&\quad = \; ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \left( {\varvec{I}}- \eta _k ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\mathrm {d}\tau ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}})\right) \\&\qquad \times ({\varvec{A}}_\sigma ^n)^{\frac{1}{2}} ({\varvec{w}}^k - {\varvec{w}}^*) \end{aligned}$$

Therefore,

$$\begin{aligned}&\Vert {\varvec{w}}^{k+1} - {\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n} \\&\quad \le \left\| {\varvec{I}}- \eta _t ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\mathrm {d}\tau ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \right\| \Vert {\varvec{w}}^k - {\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}. \end{aligned}$$

So if \(\eta _k \Vert \nabla ^2 f\Vert \le \frac{1}{\Vert ({\varvec{A}}_\sigma ^n)^{-1}\Vert } = 1\), the result follows. \(\square \)

Remark 5

The convergence result in Proposition 4 is also called \(H_\sigma ^n\)-convergence. This is because \(\langle {\varvec{u}}, {\varvec{A}}_\sigma ^n {\varvec{u}}\rangle = \Vert {\varvec{u}}\Vert ^2 + \sigma \Vert {\varvec{D}}_+^n {\varvec{u}}\Vert ^2 = \Vert {\varvec{u}}\Vert ^2_{H_\sigma ^n}\).

8 Discussion and conclusion

8.1 Some more properties of Laplacian smoothing

In Theorem 1, we established a high probability estimate of the LS operator in reducing the \(\ell _2\) norm of any given vector. The \(\ell _1\) type of high probability estimation can be established in the same way. These estimates will be helpful to develop privacy-preserving optimization algorithms to train ML models that improve the utility of the trained models without sacrificing the privacy guarantee [45].

Regarding the \(\ell _1\)/\(\ell _2\) estimates of the LS operator, we further have the following results.

Proposition 8

Given vectors \({\varvec{g}}\) and \({\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}\), for any \(p \in \mathbb {N}\), it holds that \( \Vert {\varvec{D}}_+^p{\varvec{d}}\Vert _1 \le \Vert {\varvec{D}}_+^p{\varvec{g}}\Vert _1. \) The inequality is strict unless \({\varvec{D}}_+^p {\varvec{g}}\) is a constant vector.

Proof

Observe that \({\varvec{A}}_\sigma \) and \({\varvec{D}}_+\) commute; therefore, for any \(p\in \mathbb {N}\), \({\varvec{A}}_\sigma ({\varvec{D}}_+^p{\varvec{d}}) = {\varvec{D}}_+^p{\varvec{g}}\). Thus, we have

$$\begin{aligned} (1+2\sigma )({\varvec{D}}_+^p {\varvec{d}})_i = ({\varvec{D}}_+^p {\varvec{g}})_i + \sigma ({\varvec{D}}_+^p {\varvec{d}})_{i+1} + \sigma ({\varvec{D}}_+^p{\varvec{d}})_{i-1}. \end{aligned}$$

So

$$\begin{aligned} (1+2\sigma )|({\varvec{D}}_+^p {\varvec{d}})_i| \le |({\varvec{D}}_+^p {\varvec{g}})_i| + \sigma |({\varvec{D}}_+^p {\varvec{d}})_{i+1}| + \sigma |({\varvec{D}}_+^p{\varvec{d}})_{i-1}|. \end{aligned}$$

The inequality is strict if there are sign changes among the \(({\varvec{D}}_+^p {\varvec{d}})_{i-1}\), \(({\varvec{D}}_+^p {\varvec{d}})_{i}\), \(({\varvec{D}}_+^p {\varvec{d}})_{i+1}\). Summing over i and using periodicity, we have

$$\begin{aligned} (1+2\sigma )\sum _{i=1}^m|({\varvec{D}}_+^p {\varvec{d}})_i| \le \sum _{i=1}^m|({\varvec{D}}_+^p {\varvec{g}})_i| + 2\sigma \sum _{i=1}^m|({\varvec{D}}_+^p {\varvec{d}})_{i}|, \end{aligned}$$

and the result follows. The inequality is strict unless \({\varvec{D}}_+^p {\varvec{g}}\) is a constant vector. \(\square \)

Proposition 5

Given any vector \({\varvec{g}}\in \mathbb {R}^m\) and \({\varvec{d}}=({\varvec{A}}_\sigma ^n)^{-1} {\varvec{g}}\), then

$$\begin{aligned} \Vert {\varvec{g}}\Vert ^2 = \Vert {\varvec{d}}\Vert ^2 + 2\sigma \Vert {\varvec{D}}_+^n {\varvec{d}}\Vert ^2 + \sigma ^2 \Vert {\varvec{L}}^n {\varvec{d}}\Vert ^2, \end{aligned}$$
(18)

the variance of \({\varvec{d}}\) is much less than that of \({\varvec{g}}\).

Proof

Observe that \({\varvec{g}}= {\varvec{A}}_\sigma ^n {\varvec{d}}= {\varvec{d}}+ (-1)^n\sigma {\varvec{L}}^n d\). Therefore,

$$\begin{aligned} \Vert {\varvec{g}}\Vert ^2= & {} \left\langle {\varvec{d}}+ (-1)^n\sigma {\varvec{L}}^n {\varvec{d}}, {\varvec{d}}+ (-1)^n\sigma {\varvec{L}}^n {\varvec{d}}\right\rangle \nonumber \\= & {} \Vert {\varvec{d}}\Vert ^2 + 2 (-1)^n \sigma \langle {\varvec{d}}, {\varvec{L}}^n {\varvec{d}}\rangle + \sigma ^2 \Vert {\varvec{L}}^n {\varvec{d}}\Vert ^2. \end{aligned}$$
(19)

Next, note \({\varvec{D}}_-\) and \({\varvec{D}}_+\) are commute; thus,

$$\begin{aligned} {\varvec{L}}^n = \underbrace{({\varvec{D}}_-{\varvec{D}}_+)\cdots ({\varvec{D}}_-{\varvec{D}}_+)}_{n} = \underbrace{{\varvec{D}}_-\cdots {\varvec{D}}_-}_{n} \underbrace{{\varvec{D}}_+\cdots {\varvec{D}}_+}_{n} = {\varvec{D}}_-^n {\varvec{D}}_+^n. \end{aligned}$$
(20)

Now, we have

$$\begin{aligned} \langle {\varvec{d}}, {\varvec{L}}^n {\varvec{d}}\rangle = \langle {\varvec{d}}, {\varvec{D}}_-^n {\varvec{D}}_+^n d \rangle = \langle ({\varvec{D}}_-^n)^T{\varvec{d}}, {\varvec{D}}_+^n {\varvec{d}}\rangle = \langle (-1)^n {\varvec{D}}_+^n {\varvec{d}}, {\varvec{D}}_+^n {\varvec{d}}\rangle = (-1)^n \Vert {\varvec{D}}_+^n {\varvec{d}}\Vert ^2,\nonumber \\ \end{aligned}$$
(21)

where we used (20) in the first equality and \({\varvec{D}}_-=-{\varvec{D}}_+^T\) in the second to last equality.

Substituting (21) into (19) yields (18). \(\square \)

8.2 Conclusion

In this paper, we proposed Laplacian smoothing gradient descent and its high-order generalizations. This simple modification dramatically reduces the variance and optimality gap in stochastic gradient descent, allows us to take a larger step size, and helps to find better minima. Extensive numerical examples ranging from toy cases and shallow and deep neural nets to generative adversarial networks and deep reinforcement learning all demonstrate the advantage of the proposed smoothed gradient.