Laplacian smoothing gradient descent

Osher, Stanley; Wang, Bao; Yin, Penghang; Luo, Xiyang; Barekat, Farzin; Pham, Minh; Lin, Alex

doi:10.1007/s40687-022-00351-1

Laplacian smoothing gradient descent

Research
Published: 12 August 2022

Volume 9, article number 55, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Research in the Mathematical Sciences Aims and scope Submit manuscript

Laplacian smoothing gradient descent

Download PDF

Stanley Osher¹,
Bao Wang²,
Penghang Yin³,
Xiyang Luo¹,
Farzin Barekat¹,
Minh Pham¹ &
…
Alex Lin¹

569 Accesses
7 Citations
Explore all metrics

Abstract

We propose a class of very simple modifications of gradient descent and stochastic gradient descent leveraging Laplacian smoothing. We show that when applied to a large variety of machine learning problems, ranging from logistic regression to deep neural nets, the proposed surrogates can dramatically reduce the variance, allow to take a larger step size, and improve the generalization accuracy. The methods only involve multiplying the usual (stochastic) gradient by the inverse of a positive definitive matrix (which can be computed efficiently by FFT) with a low condition number coming from a one-dimensional discrete Laplacian or its high-order generalizations. Given any vector, e.g., gradient vector, Laplacian smoothing preserves the mean and increases the smallest component and decreases the largest component. Moreover, we show that optimization algorithms with these surrogates converge uniformly in the discrete Sobolev $H_\sigma ^p$ sense and reduce the optimality gap for convex optimization problems. The code is available at: https://github.com/BaoWangMath/LaplacianSmoothing-GradientDescent.

Stochastic proximal-gradient algorithms for penalized mixed models

Article 12 February 2018

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Article 06 December 2018

Proximal Gradient Method with Extrapolation and Line Search for a Class of Non-convex and Non-smooth Problems

Article 19 December 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Stochastic gradient descent (SGD) [37] has been the workhorse for solving large-scale machine learning (ML) problems. It gives rise to a family of algorithms that enables efficient training of many ML models including deep neural nets (DNNs). SGD utilizes training data very efficiently at the beginning of the training phase, as it converges much faster than GD and L-BFGS during this period [8, 16]. Moreover, the variance of SGD can help gradient-based optimization algorithms circumvent local minima and saddle points and reach those that generalize well [18, 38]. However, the variance of SGD also slows down the convergence after the first few training epochs. To account for the effect of SGD’s variance and to ensure the convergence of SGD, a decaying step size has to be applied which is one of the major bottlenecks for the fast convergence of SGD [7, 40, 41]. Moreover, in training many ML models, typically the stage-wise schedule of learning rate is used in practice [38, 39]. In this scenario, the variance of SGD usually leads to a large optimality gap.

Through numerical evidence of the bottlenecks of SGD in deep learning, we answer the following question in this work: Can we improve SGD such that the variance of the stochastic gradient is reduced on-the-fly with negligible extra computational and memory overhead and a larger step size is allowed to train ML models?

We answer the above question affirmatively by applying the discrete one-dimensional Laplacian smoothing (LS) operator to smooth the stochastic gradient vector on-the-fly. The LS operation can be performed efficiently by using the fast Fourier transform (FFT).

1.1 Our contribution

In this paper, we propose a new modification to the stochastic gradient-based algorithms, which at its core uses the LS operator to reduce the variance of stochastic gradient vector on-the-fly. The (stochastic) gradient smoothing can be done by multiplying the gradient by the inverse of the following circulant convolution matrix

$$\begin{aligned} {\varvec{A}}_\sigma := \begin{bmatrix} 1+2\sigma &{} -\sigma &{} 0&{}\dots &{}0&{} -\sigma \\ -\sigma &{} 1+2\sigma &{} -\sigma &{} \dots &{}0&{}0 \\ 0 &{} -\sigma &{} 1+2\sigma &{} \dots &{} 0 &{} 0 \\ \dots &{} \dots &{} \dots &{}\dots &{} \dots &{} \dots \\ -\sigma &{}0&{} 0 &{} \dots &{}-\sigma &{} 1+2\sigma \end{bmatrix} \end{aligned}$$

(1)

for some positive constant $\sigma \ge 0$. In fact, we can write ${\varvec{A}}_\sigma = {\varvec{I}}-\sigma {\varvec{L}}$, where ${\varvec{I}}$ is the identity matrix, and ${\varvec{L}}$ is the discrete one-dimensional Laplacian which acts on indices. If we define the (periodic) forward finite difference matrix as

$$\begin{aligned} {\varvec{D}}_+= \begin{bmatrix} -1 &{} 1 &{} 0&{}\dots &{}0&{} 0 \\ 0 &{} -1 &{} 1 &{} \dots &{}0&{}0 \\ 0 &{} 0 &{} -1 &{} \dots &{} 0 &{} 0 \\ \dots &{} \dots &{} \dots &{}\dots &{} \dots &{} \dots \\ 1 &{}0&{} 0 &{} \dots &{}0 &{} -1 \end{bmatrix}, \end{aligned}$$

then, we have ${\varvec{A}}_\sigma = {\varvec{I}}- \sigma {\varvec{D}}_-{\varvec{D}}_+$, where ${\varvec{D}}_- = -{\varvec{D}}_+^\top $ is the backward finite difference.

We summarize the benefits of this simple LS operation below:

It reduces the variance of stochastic gradient on-the-fly and reduces the optimality gap when constant step size is used.
It allows us to take a larger step size than the standard (S)GD.
It is applicable to train many ML models including DNNs with better generalization.
It converges faster for the strongly convex quadratic objective functions that have a large condition number^{Footnote 1} numerically.
It is more robust to the noisy gradient.

Moreover, as a straightforward extension, we generalize the LS to high-order smoothing operators, e.g., biharmonic smoothing.

1.2 Related work

There is an extensive volume of research over the past decades for designing algorithms to speed up the convergence. These include using momentum and other heavy-ball methods, reduce the variance of the stochastic gradient, and adapt the learning rate.

The first type of idea to accelerate the convergence of GD and SGD is to apply the momentum. Around local optima, the surface curves can be much more steep in one dimension than in another [43], whence (S)GD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum. Momentum is proposed to accelerate (S)GD in the relevant direction and dampens oscillations [34]. Nesterov accelerated gradient (NAG) is also introduced to slow down the progress before the surface curve slopes up, and it provably converges faster in specific scenarios [31].

Due to the bottleneck of the variance of the stochastic gradient, a natural idea is to reduce the variance of the stochastic gradient. There are several principles in developing variance reduction algorithms [8], including dynamic sample size methods; gradient aggregation, control variate type of technique, is widely used along this direction; some representative works are SAGA [11], SCSG [24], and SVRG [19]; and iterative averaging methods.

Another category of work tries to speed up the convergence of GD and SGD by using an adaptive step size, which makes use of the historical gradient to adapt the step size. RMSProp [44] and Adagrad [13] adapt the learning rate to the parameters, performing smaller updates (i.e., low learning rates) for parameters associated with frequently occurring features, and more substantial updates (i.e., high learning rates) for parameters associated with infrequent features. Both RMSProp and Adagrad make the learning rate to be historical gradient dependent. Adadelta [48] extends the idea of RMSProp and Adagrad; instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size w. Adam [21] and AdaMax [21] behave like a heavy ball with friction, and they compute the decaying averages of past and past squared gradients to adaptive the learning rate. AMSGrad [36] fixes the issue of Adam that may fail to converge to an optimal solution. Adam can be viewed as a combination of RMSprop and momentum: RMSprop contributes the exponentially decaying average of past squared gradients, while momentum accounts for the exponentially decaying average of past gradients. Since NAG is superior to vanilla momentum, Dozat [12] proposed NAdam which combines the idea Adam and NAG.

1.3 Notations

Throughout this paper, we use boldface upper-case letters ${\varvec{A}}$, ${\varvec{B}}$ to denote matrices and boldface lower-case letters ${\varvec{w}}$, ${\varvec{u}}$ to denote vectors. For vectors, we use $\Vert \cdot \Vert $ to denote the $\ell _2$-norm for vectors and spectral norm for matrices, respectively. And we use $\lambda _{max}({\varvec{A}})$, $\lambda _{min}({\varvec{A}})$, and $\lambda _i({\varvec{A}})$ to denote the largest, smallest, and the i-th largest eigenvalues, respectively, of a given Hermitian matrix ${\varvec{A}}$. For a function $f: \mathbb {R}^n\rightarrow \mathbb {R}$, we use $\nabla f$ and $\nabla ^2 f$ to denote its gradient and Hessian, and $f^*$ to denote a local minimum value of f. For a positive definite matrix ${\varvec{A}}$, we define the vector induced norm by as $\Vert {\varvec{w}}\Vert _{\varvec{A}}:=\sqrt{\langle {\varvec{w}}, {\varvec{A}}{\varvec{w}}\rangle }$. List $\{1, 2, \cdots , n\}$ is denoted by [n].

1.4 Organization

We organize this paper as follows: In Sect. 2, we introduce the LS-(S)GD algorithm and the FFT-based fast solver. In Sect. 3, we show that LS-(S)GD allows us to take a larger step size than (S)GD. In Sect. 4, we show that LS reduces the variance of SGD both empirically and theoretically. We show that LS-GD can avoid some local minima and speed up convergence numerically in Sect. 5. In Sect. 6, we show the benefit of LS in deep learning, including training LeNet [23], ResNet [17], Wasserstein generative adversarial nets (WGAN) [27], and deep reinforcement learning (DRL) model. The convergence analysis for LS-(S)GD is provided in Sect. 7. Most of the technical proofs are provided in Sect. 8.2.

2 Laplacian smoothing (stochastic) gradient descent

We present our algorithm for SGD in the finite-sum setting. The GD and other settings follow straightforwardly. Consider the following finite-sum optimization

$$\begin{aligned} \min _{{\varvec{w}}} F({\varvec{w}}) :=\frac{1}{n}\sum _{i=1}^nf_i({\varvec{w}}), \end{aligned}$$

(2)

where $f_i({\varvec{w}}) \doteq f({\varvec{w}}, {\varvec{x}}_i, y_i)$ is the loss of a given ML model on the training data $\{{\varvec{x}}_i, y_i\}$. This finite-sum formalism is an abstract of training many ML models mentioned above. To resolve the optimization problem (2), starting from some initial guess ${\varvec{w}}^0$, the $(k+1)$-th iteration of SGD reads

$$\begin{aligned} {\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta _k \nabla f_{i_k}({\varvec{w}}^k), \end{aligned}$$

(3)

where $\eta _k$ is the step size, $i_k$ is a random sample with replacement from [n].

We propose to replace the stochastic gradient $\nabla f_{i_k}({\varvec{w}}^k)$ by the Laplacian smoothed surrogate, and we call the resulting algorithm LS-SGD, which is written as

$$\begin{aligned} {\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta _k {\varvec{A}}_\sigma ^{-1} \nabla f_{i_k}({\varvec{w}}^k). \end{aligned}$$

(4)

Intuitively, compared to the standard GD, this scheme smooths the gradient on-the-fly by an elliptic smoothing operator while preserving the mean of the entries of the gradient. We adopt fast Fourier transform (FFT) to compute ${\varvec{A}}_\sigma ^{-1}\nabla f({\varvec{w}}^{k})$, which is available in both PyTorch [33] and TensorFlow [2]. Given a vector ${\varvec{g}}$, a smoothed vector ${\varvec{d}}$ can be obtained by computing ${\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}$. This is equivalent to ${\varvec{g}}={\varvec{d}}-\sigma {\varvec{v}}*{\varvec{d}}$, where ${\varvec{v}}= [-2, 1, 0, \cdots , 0, 1]^\top $ and $*$ is the convolution operator. Therefore,

$$\begin{aligned} {\varvec{d}}= \mathrm{ifft}\left( \frac{\mathrm{fft}({\varvec{g}})}{\mathbf {1} -\sigma \cdot \mathrm{fft}({\varvec{v}})}\right) , \end{aligned}$$

where we use component-wise division (here, $\mathrm{fft}$ and $\mathrm{ifft}$ are the FFT and inverse FFT, respectively). Hence, the gradient smoothing can be done in quasilinear time. This additional time complexity is almost the same as performing a one-step update on the weights vector ${\varvec{w}}$. For many machine learning models, we may need to concatenate the parameters into a vector. This reshaping might lead to some ambiguity, nevertheless, based on our tests, both row and column majored reshaping work for the LS-GD algorithm. Moreover, in deep learning cases, the weights in different layers might have different physical meanings. For these cases, we perform layer-wise gradient smoothing, instead. We summarize the LS-SGD for solving the finite-sum optimization (2) in Algorithm 1.

Remark 1

In image processing and elsewhere, the Sobolev gradient [20] uses a multi-dimensional Laplacian operator that operates on ${\varvec{w}}$ and is different from the one-dimensional discrete Laplacian operator employed in our LS-GD scheme that operates on indices.

It is worth noting that LS is a complement to the momentum and adaptive learning rate, e.g., Adagrad algorithms. It can be combined with these acceleration techniques to speed up the convergence. We will show the performance of these algorithms in Sect. 6.

2.1 Generalized smoothing gradient descent

We can generalize ${\varvec{A}}_\sigma $ to the n-th-order discrete hyper-diffusion operator as follows

$$\begin{aligned} {\varvec{I}}+ (-1)^n \sigma {\varvec{L}}^n \doteq {\varvec{A}}_\sigma ^n. \end{aligned}$$

Each row of the discrete Laplacian operator ${\varvec{L}}$ consists of an appropriate arrangement of weights in central finite difference approximation to the second-order derivative. Similarly, each row of ${\varvec{L}}^n$ is an arrangement of the weights in the central finite difference approximation to the 2n-th-order derivative.

Remark 2

The n-th-order smoothing operator ${\varvec{I}}+ (-1)^n \sigma {\varvec{L}}^n$ can only be applied to the problem with dimension at least $2n+1$. Otherwise, we need to add dummy variables.

Again, we apply FFT to compute the smoothed gradient vector. For a given gradient vector ${\varvec{g}}$, the smoothed surrogate, $({\varvec{A}}_\sigma ^n)^{-1}{\varvec{g}}\doteq {\varvec{d}}$, can be obtained by solving ${\varvec{g}}={\varvec{d}}+(-1)^n\sigma {\varvec{v}}_n*{\varvec{d}}$, where ${\varvec{v}}_n = (c^n_{n+1}, c^n_{n+2}, \cdots , c^n_{2n+1}, 0, \cdots , 0, c^n_1,$ $ c^n_2, \cdots , c^n_{n-1}, c^n_n)$ is a vector of the same dimension as the gradient to be smoothed. And the coefficient vector ${\varvec{c}}^n = (c^n_1, c^n_2, \cdots , c^n_{2n+1})$ can be obtained recursively by the following formula:

$$\begin{aligned} {\varvec{c}}^1 = (1, -2, 1), \quad c^n_i = {\left\{ \begin{array}{ll} 1 &{}\hbox { }\ i = 1, 2n+1,\\ -2c^{n-1}_1+c^{n-1}_2 &{}\hbox { }\ i = 2, 2n,\\ c^{n-1}_{i-1}-2c^{n-1}_i + c^{n-1}_{i+1} &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

Remark 3

The computational complexities for different order smoothing schemes are the same when the FFT is utilized for computing the surrogate gradient.

3 The choice of step size

In this section, we will discuss the step size issue of LS-(S)GD with a theoretical focus on LS-GD on the function with L-Lipschitz gradient.

Definition 1

We say the function F has L-Lipschitz gradient, if for any ${\varvec{w}}, {\varvec{u}}\in \mathbb {R}^m$, we have $\Vert \nabla F({\varvec{w}}) - \nabla F({\varvec{u}})\Vert \le L\Vert {\varvec{w}}- {\varvec{u}}\Vert $.

For the function with L-Lipschitz gradient, it is known that the largest suitable step size for GD is $\eta _{max}^{GD} = 1/L$ [32]. In the following, we will establish a $\ell _2$ estimate of the square root of the LS operator when it is applied to an arbitrary vector. Based on these estimates, we will show that LS-GD can take a larger step size than GD.

To determine the largest suitable step size for LS-GD, we first do a change of variable in the LS-GD (2) by letting ${\varvec{y}}^k = {\varvec{H}}^{-1}_\sigma {\varvec{w}}^k$ where ${\varvec{H}}_\sigma ={\varvec{A}}_\sigma ^{-1/2}$, and then LS-GD can be written as

$$\begin{aligned} {\varvec{y}}^{k+1} = {\varvec{y}}^k - \eta _k {\varvec{H}}_\sigma \nabla F({\varvec{H}}_\sigma {\varvec{y}}^k), \end{aligned}$$

(5)

which is actually the GD for solving the following minimization problem

$$\begin{aligned} \min _{\varvec{y}}F({\varvec{H}}_\sigma {\varvec{y}}) := \min _{{\varvec{y}}} G({\varvec{y}}). \end{aligned}$$

(6)

Therefore, to determine the largest suitable step size for LS-GD, it is equivalent to find the largest appropriate step size for GD for $\min _{{\varvec{w}}} G({\varvec{w}})$ (here we use ${\varvec{w}}$ instead of ${\varvec{y}}$ for the ease of notation).

3.1 $\ell _2$ estimates

First, for any ${\varvec{w}}, {\varvec{u}}\in {\mathbb {R}}^m$, suppose the function F has L-Lipschitz gradient, i.e.,

$$\begin{aligned} \Vert \nabla F({\varvec{w}}) - \nabla F({\varvec{u}})\Vert \le L\Vert {\varvec{w}}-{\varvec{u}}\Vert . \end{aligned}$$

Then

$$\begin{aligned} \Vert \nabla G({\varvec{w}}) - \nabla G({\varvec{u}})\Vert&= \Vert {\varvec{H}}_\sigma (\nabla F({\varvec{w}}) - \nabla F({\varvec{u}}))\Vert \\&\le \Vert {\varvec{H}}_\sigma \Vert \Vert \nabla F({\varvec{w}})-\nabla F({\varvec{u}})\Vert \\&\le \Vert \nabla F({\varvec{w}})-\nabla F({\varvec{u}})\Vert , \end{aligned}$$

where the last inequality is because $\Vert {\varvec{H}}_\sigma \Vert $’s eigenvalues are all not greater than one. This estimate indicates that we can use a step size at least 1/L in GD for searching the optimum of the function G, i.e., we can use a step size at least 1/L in LS-GD for searching the optimum of the function F.

Next, we establish a high probability estimation for taking a larger step size when using LS-GD. Without any prior knowledge about ${\varvec{v}}:=\nabla F({\varvec{w}}) - \nabla F({\varvec{u}})$, let us assume it is sampled uniformly from a ball in $\mathbb {R}^m$ centered at the origin. Without loss of generality, we assume the radius of this ball is one. Under the above ansatz, we have the following result

Theorem 1

($\ell _2$-estimate) Let $\sigma >0$, and

$$\begin{aligned} \beta = \frac{1}{m}\sum _{i=1}^m \frac{1}{|1+2\sigma -\sigma z_i - \sigma \overline{z_i}| }, \end{aligned}$$

where $z_1$, $\cdots $, $z_m$ are the m roots of unity. Let ${\varvec{v}}$ be uniformly distributed in the unit ball of the m-dimensional $\ell _2$ space. Then for any $\alpha > \frac{\sqrt{\beta }}{1-\frac{\pi }{\sqrt{m}}}$, we have

$$\begin{aligned} \mathbb {P}\left( \Vert {\varvec{H}}_\sigma {\varvec{v}}\Vert \ge \alpha \Vert {\varvec{v}}\Vert \right) \le 2\exp {\left( -\frac{2}{\pi ^2}m\left( \frac{\alpha -\alpha \frac{\pi }{\sqrt{m}} -\sqrt{\beta } }{\alpha +1} \right) ^2 \right) }. \end{aligned}$$

(7)

The proof of this theorem is provided in the appendix. For high-dimensional ML problems, e.g., training DNNs, m can be as large as tens of millions so that the probability will be almost one. The closed form of $\beta $ is given in Lemma 1.

Lemma 1

If $z_1,\ldots ,z_m$ denote the m roots of unity, then

$$\begin{aligned} \beta = \frac{1}{m}\sum _{j=1}^{m} \frac{1}{|1+2\sigma -\sigma z_j - \sigma \bar{z_j}| } = \frac{1+\omega ^m}{(1-\omega ^m)\sqrt{1+4\sigma }} \rightarrow _{m\rightarrow \infty } \frac{1}{\sqrt{1+4\sigma }}, \end{aligned}$$

(8)

where $\omega = \frac{2\sigma +1 - \sqrt{1+4\sigma }}{2\sigma } < 1.$

The proof of the above lemma needs some tools from complex and harmonic analysis, which is provided in the appendix. Table 1 lists some typical values for different $\sigma $ and m.

Table 1 The values of $\beta $ corresponding to some $\sigma $ and m

Full size table

Based on Theorem 1, LS-GD can take the largest step size $\frac{1}{\sqrt{\beta } L}$ for high-dimensional function with L-Lipschitz gradient with high probability. We will numerically verify this later.

4 Variance reduction

The variance of SGD is one of the major bottlenecks that slows down the theoretical guaranteed convergence rate in training ML models. Most of the existing variance reduction algorithms require either the full batch gradient or the storage of stochastic gradient for each data point which makes it difficult to be used to train the high-capacity DNNs. LS is an alternative approach to reduce the variance of the stochastic gradient with negligible extra computational time and memory cost. In this section, we rigorously show that LS reduces the variance of the stochastic gradient and reduce the optimality gap under the Gaussian noise assumption. Moreover, we numerically verify our theoretical results on both a quadratic function and a simple finite-sum optimization problem.

4.1 Gaussian noise scenario

Stochastic gradient $\nabla f_{i_k}$, for any $i_k \in [n]$, is an unbiased estimate of $\nabla F$; many existing works model the variance between the stochastic gradient and full batch gradient $\nabla F$ as Gaussian noise $\mathcal {N}(\mathbf {0}, \Sigma )$, where $\Sigma $ is the covariance matrix [28]. Therefore, ignoring the variable ${\varvec{w}}$ for simplicity of notation, we can write the equation involving gradient and stochastic gradient vectors as

$$\begin{aligned} \nabla f_{i_k} = \nabla F + {\varvec{n}}, \end{aligned}$$

(9)

where ${\varvec{n}}\sim \mathcal {N}(\mathbf {0}, \Sigma )$. Thus, for LS stochastic gradient, we have

$$\begin{aligned} {\varvec{A}}_\sigma ^{-1}\nabla f_{i_k} = {\varvec{A}}_\sigma ^{-1}\left( \nabla F + {\varvec{n}}\right) . \end{aligned}$$

(10)

The variances of stochastic gradient and LS stochastic gradient are basically the variance of ${\varvec{n}}$ and ${\varvec{A}}_\sigma ^{-1}{\varvec{n}}$, respectively. The following theorem quantifies the variance between ${\varvec{n}}$ and ${\varvec{A}}_\sigma ^{-1}{\varvec{n}}$.

Theorem 2

Let $\kappa $ denote the condition number of $\Sigma $. Then, for m-dimensional Gaussian random vector ${\varvec{n}}\sim \mathcal {N}(\mathbf {0}, \Sigma )$, we have

$$\begin{aligned} \frac{\sum _{i=1}^m \mathrm {Var}[\left( ({\varvec{A}}_\sigma ^n)^{-1} {\varvec{n}}\right) _i]}{\sum _{i=1}^m\mathrm {Var}[ \left( {\varvec{n}}\right) _i ]} \le 1 - \frac{1}{\kappa } + \frac{1}{\kappa m}\sum _{j=0}^{m}\frac{1}{[1+4^n\sigma \sin ^{2n}(\pi j/m)]^2}. \end{aligned}$$

(11)

The proof of Theorem 2 will be provided in the appendix.

Table 2 lists the ratio of variance after and before LS for an m-D standard normal vector ${\varvec{n}}\sim \mathcal {N}(\mathbf {0}, {\varvec{I}})$. In practice, high-order smoothing reduces variance more significantly.

Table 2 Theoretical upper bound of $\sum _{i=1}^m \mathrm {Var}[\left( ({\varvec{A}}_\sigma ^n)^{-1} {\varvec{n}}\right) _i]/\sum _{i=1}^m\mathrm {Var}[ \left( {\varvec{n}}\right) _i ]$ when $\mathbf {n}$ is an m-dimensional standard normal vector with $m\ge 10000$

Full size table

Moreover, LS preserves the mean (Proposition 1), decreases the largest component and increases the smallest component (Proposition 2) for any vector.

Proposition 1

For any vector ${\varvec{g}}\in \mathbb {R}^m$, ${\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}$, let $j_{\max } = \arg \max _i d_i$ and $j_{\min } = \arg \min _i d_i $. We have $\max _i d_i = d_{j_{\max }} \le g_{j_{\max }} \le \max _i g_i$ and $\min _i d_i = d_{j_{\min }} \ge g_{j_{\min }} \ge \min _i g_i$.

Proof

Since ${\varvec{g}}= {\varvec{A}}_{\sigma }{\varvec{d}}$, it holds that

$$\begin{aligned} g_{j_{\max }} = d_{j_{\max }} + \sigma ( 2 d_{j_{\max }} - d_{j_{\max }-1} - d_{j_{\max }+1}), \end{aligned}$$

where periodicity of index is used if necessary. Since $ 2 d_{j_{\max }} - d_{j_{\max }-1} - d_{j_{\max }+1}\ge 0$, We have $\max _i d_i = d_{j_{\max }} \le g_{j_{\max }} \le \max _i g_i$. Similarly, we can show $\min _i d_i = d_{j_{\min }} \ge g_{j_{\min }} \ge \min _i g_i$. $\square $

Proposition 2

The operator ${\varvec{A}}_\sigma ^{-1}$ preserves the sum of components. For any ${\varvec{g}}\in \mathbb {R}^m$ and ${\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}$, we have $\sum _j d_j = \sum _j g_j$, or equivalently, $\mathbf {1}^\top {\varvec{d}}= \mathbf {1}^\top {\varvec{g}}$.

Proof

Since ${\varvec{g}}= {\varvec{A}}_\sigma {\varvec{d}}$,

$$\begin{aligned} \sum _i g_i = \mathbf {1}^\top \mathbf {g} = \mathbf {1}^\top ({\varvec{I}}+ \sigma {\varvec{D}}_+^\top {\varvec{D}}_+){\varvec{d}}= \mathbf {1}^\top {\varvec{d}}= \sum _i d_i, \end{aligned}$$

where we used ${\varvec{D}}_+\mathbf {1} = \mathbf {0}$. $\square $

4.2 Reduce the optimality gap

A direct benefit of variance reduction is that it reduces the optimality gap in SGD when constant step size is applied.

Proposition 3

Suppose f is convex with a global minimizer ${\varvec{w}}^*$, and $f^* = f({\varvec{w}}^*)$. Consider the following iteration with constant learning rate $\eta >0$

$$\begin{aligned} {\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta ({\varvec{A}}_\sigma ^n)^{-1} {\varvec{g}}^k \end{aligned}$$

where ${\varvec{g}}^k$ is the sampled gradient in the k-th iteration at ${\varvec{w}}^k$ satisfying $\mathbb {E}[{\varvec{g}}^k] = \nabla f({\varvec{w}}^k)$. Denote $G_{{\varvec{A}}_\sigma ^n}: = \frac{1}{K}\sum _{k=0}^{K-1}{\mathbb {E}}\Vert {\varvec{g}}^k\Vert ^2_{({\varvec{A}}_\sigma ^n)^{-1}}$ and ${\overline{{\varvec{w}}}}^K := \sum _{k=0}^{K-1} {\varvec{w}}^k /K$ the ergodic average of iterates. Then the optimality gap after K-th iteration is

$$\begin{aligned} \mathbb {E}[f({\overline{{\varvec{w}}}}^K)] - f^* \le {\frac{1}{2\eta K} \mathbb {E}[\Vert {\varvec{w}}^{0} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2]} + \frac{\eta G_{{\varvec{A}}_\sigma ^n}}{2}. \end{aligned}$$

Proof

Since f is convex, we have

$$\begin{aligned} \langle \nabla f({\varvec{w}}^k), {\varvec{w}}^{k} - {\varvec{w}}^* \rangle \ge f({\varvec{w}}^k) - f^*. \end{aligned}$$

(12)

Furthermore,

$$\begin{aligned}&\; \mathbb {E}[\Vert {\varvec{w}}^{k+1} -{\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}^2] = \mathbb {E}[\Vert {\varvec{w}}^{k} - \eta ({\varvec{A}}_\sigma ^n)^{-1}{\varvec{g}}^k -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] \\&\quad = \; \mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert ^2_{{\varvec{A}}_\sigma ^n}] - 2\eta \mathbb {E}[\langle {\varvec{g}}^k, {\varvec{w}}^k - {\varvec{w}}^* \rangle ]+ \eta ^2\mathbb {E}[\Vert ({\varvec{A}}_\sigma ^n)^{-1}{\varvec{g}}^{{k}}\Vert _{{\varvec{A}}_\sigma ^n}^2] \\&\quad \le \; \mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] - 2\eta \mathbb {E}[\langle \nabla f({\varvec{w}}^k), {\varvec{w}}^k - {\varvec{w}}^* \rangle ]+ \eta ^2{\mathbb {E}} \Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 \\&\quad \le \; \mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] - 2\eta (\mathbb {E}[f({\varvec{w}}^k)]-f^*) + \eta ^2 {\mathbb {E}}\Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2, \end{aligned}$$

where the last inequality is due to (12). We rearrange the terms and arrive at

$$\begin{aligned} \mathbb {E}[ f({\varvec{w}}^k)]-f^*\le&\, \frac{1}{2\eta }(\mathbb {E}[\Vert {\varvec{w}}^{k} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] - \mathbb {E}[\Vert {\varvec{w}}^{k+1} -{\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}^2]) + \frac{\eta {\mathbb {E}} \Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 }{2}. \end{aligned}$$

Summing over k from 0 to $K-1$ and averaging and using the convexity of f, we have

$$\begin{aligned} \mathbb {E}[f({\overline{{\varvec{w}}}}^K)] - f^*&\le \; \frac{\sum _{k=0}^{K-1} \mathbb {E}[f({\varvec{w}}^k)]}{K} - f^*\\&\le \frac{1}{2\eta K} \mathbb {E}[\Vert {\varvec{w}}^{0} -{\varvec{w}}^* \Vert _{{\varvec{A}}_\sigma ^n}^2] + \frac{\sum _{k=0}^{K-1}{\mathbb {E}}\Vert {\varvec{g}}^k\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 }{2K} \eta . \end{aligned}$$

$\square $

Remark 4

Since $G_{{\varvec{A}}_\sigma ^n}$ is smaller than the corresponding value without LS, it shows that the optimality gap is reduced when LS is used with a constant step size. In practice, this is also true for the stage-wise step size since it is a constant in each stage of the training phase.

4.2.1 Optimization for quadratic function with noise corrupted gradient

In this part, we empirically show the advantages of the LS-(S)GD and its generalized schemes for the convex optimization problems with noisy gradient. Consider searching the minima ${\varvec{x}}^*$ of the quadratic function $f({\varvec{x}})$ defined in (13).

$$\begin{aligned} f(x_1, x_2, \cdots , x_{100}) = \sum _{i=1}^{50}x_{2i-1}^2 + \sum _{i=1}^{50} \frac{x_{2i}^2}{10^2}. \end{aligned}$$

(13)

Here, we consider the gradient with Gaussian noise injection, i.e., at any given point ${\varvec{x}}$, we have

$$\begin{aligned} {\tilde{\nabla }}_\epsilon f({\varvec{x}}) := \nabla f({\varvec{x}}) + \epsilon \mathcal {N}(\mathbf {0}, {\varvec{I}}), \end{aligned}$$

where the scalar $\epsilon $ controls the noise level, $\mathcal {N}(\mathbf {0}, {\varvec{I}})$ is the Gaussian noise vector with zero mean and unit variance in each coordinate. The corresponding numerical schemes can be formulated as

$$\begin{aligned} {\varvec{x}}^{k+1} = {\varvec{x}}^k - \eta _k ({\varvec{A}}_\sigma ^n)^{-1} {\tilde{\nabla }}_\epsilon f({\varvec{x}}^k), \end{aligned}$$

(14)

where $\sigma $ is the smoothing parameter selected to be 10.0 to remove the intense noise. We take diminishing step sizes with initial values 0.1 for SGD/smoothed SGD; 0.9 and 1.8 for GD/smoothed GD, respectively. Without noise, the smoothing allows us to take larger step sizes; rounding to the first digit, 0.9 and 1.8 are the largest suitable step size for GD and smoothed version here. We study both constant learning rate and exponentially decaying learning rate, i.e., after every 1000 iteration the learning rate is divided by 10. We apply different schemes that corresponding to $n=0, 1, 2$ in (14) to the problem ((13)), with the initial point ${\varvec{x}}^0=(1, 1, \cdots , 1)$.

Figure 1 shows the iteration versus optimality gap when the constant learning rate is used. In the noise free case, all three schemes converge linearly. When there is noise, our smoothed gradient helps to reduce the optimality gap and converges faster after a few iterations.

The exponentially decaying learning rate helps our smoothed SGD to reach a point with a smaller optimality gap, and the higher-order smoothing further reduces the optimality gap, as shown in Fig. 2. This is due to the noise removal properties of the smoothing operators.

4.2.2 Find the center of multiple points

Consider finding the center of a given set of 5K points $\{\mathbf {x}_i\in \mathbb {R}^{50}\}_{i=1}^{5000}$. It can be formulated as the following finite-sum optimization

$$\begin{aligned} \min _{\varvec{x}}F({\varvec{x}}) := \frac{1}{N}\sum _{i=1}^N f_i({\varvec{x}}) = \frac{1}{N}\sum _{i=1}^N \Vert {\varvec{x}}_i - {\varvec{x}}\Vert ^2. \end{aligned}$$

(15)

We solve this optimization problem by running either SGD or LS-SGD for 20K iterations starting from the same random initial point with batch size 20. The initial step size is set to be 1.0 and 1.2, respectively, for SGD and LS-SGD, and decays 1.1 times after every 10 iterations. As the learning rate decays, the variance of the stochastic gradient decays [46]; thus, we decay $\sigma $ 10 times after every 1K iterations. Figure 3a plots a 2D cross section of the trajectories of SGD and LS-SGD, and it shows that the trajectory of SGD is more noisy than that of LS-SGD. Figure 3b plots the iteration versus loss for both SGD and LS-SGD averaged over 3 independent runs. LS-SGD converges faster than SGD and has a smaller optimality gap than LS-SGD. This numerical result verifies our theoretical results on the optimality gap (Proposition 3).

4.2.3 Multi-class logistic regression

Consider applying the proposed optimization schemes to train the multi-class logistic regression model. We run 200 epochs of SGD and different order smoothing algorithms to maximize the likelihood of multi-class logistic regression with batch size 100. And we apply the exponentially decaying learning rate with initial value 0.5 and decay 10 times after every 50 epochs. We train the model with only 10 $\%$ randomly selected MNIST training data and test the trained model on the entire testing images. We further compare with SVRG under the same setting. Figure 4 shows the histograms of generalization accuracy of the model trained by SGD (a); SVRG (b); LS-SGD (order 1) (c); LS-SGD (order 2) (d). It is seen that SVRG somewhat improves the generalization with higher averaged accuracy. However, the first- and the second-order LS-SGD type algorithms lift the averaged generalization accuracy by more than $1 \%$ and reduce the variance of the generalization accuracy over 100 independent trials remarkably.

4.3 Iteration versus loss

In this part, we show the evolution of the loss in training the multi-class logistic regression model by SGD, SVRG, LS-GD with first- and second-order smoothing, respectively. As illustrated in Fig. 5. At each iteration, among 100 independent experiments, SGD has the largest variance, and SGD with first-order smoothed gradient significantly reduces the variance of loss among different experiments. The second-order smoothing can further reduce the variance. The variance of loss in each iteration among 100 experiments is minimized when SVRG is used to train the multi-class logistic model. However, the generalization performance of the model trained by SVRG is not as good as the ones trained by LS-SGD, or higher-order smoothed gradient descent (Fig. 4b).

4.4 Variance reduction in stochastic gradient

We verify the efficiency of variance reduction numerically in this part. We simplify the problem by applying the multi-class logistic regression only to the digits 1 and 2 of the MNIST training data. In order to compute the variance of the (LS)-stochastic gradients, we first compute descent path of (LS)-GD by applying the full batch (LS)-GD with learning rate 0.5 starting from the same random initialization. We record the full batch (LS)-gradient on each point along the descent path. Then we compute the (LS)-stochastic gradients on each points along the path by using different batch sizes and smoothing parameters $\sigma $. In computing (LS)-stochastic gradients we run 100 independent experiments. Then we compute the variance of the (LS)-stochastic gradient among these 100 experiments and regarding the full batch (LS)-gradient as the mean on each point along the full batch (LS)-GD descent path. For each pair of batch size and $\sigma $, we report the maximum variance over all the coordinates of the gradient and all the points along the descent path. We list the variance results in Table 3 (note the case $\sigma =0$ corresponds to the SGD). These results show that compared to the SGD, LS-GD with $\sigma =3$ can reduce the maximum variance $\sim {\textbf {100}}$ times for different batch sizes. It is worth noting that the high-order smoothing reduces more variance than the lower-order smoothing; this might be due to the fact that the noise of SGD is not Gaussian.

Table 3 The maximum variance of the stochastic gradient generated by LS-SGD with different $\sigma $ and batch size. $\sigma =0$ recovers the SGD

Full size table

5 Numerical results on avoiding local minima and speeding up convergence

We first show that LS-GD can bypass sharp minima and reach the global minima. We consider the following function, in which we ‘drill’ narrow holes on a smooth convex function,

$$\begin{aligned} \begin{aligned} f(x, y, z)&= -4e^{-\left( (x-\pi )^2+(y-\pi )^2+(z-\pi )^2\right) }-\\&\quad \times 4\sum _{i}\cos (x)\cos (y)e^{-\beta \left( (x-r\sin (\frac{i}{2})-\pi )^2+(y-r\cos (\frac{i}{2})-\pi )^2\right) }, \end{aligned} \end{aligned}$$

(16)

where the summation is taken over the index set $\{i\in \mathbb {N}| \; 0 \le i < 4\pi \}$, r and $\beta $ are the parameters that determine the location and narrowness of the local minima and are set to 1 and $\frac{1}{\sqrt{500}}$, respectively. We do GD and LS-GD starting from a random point in the neighborhoods of the narrow minima, i.e., $(x_0, y_0, z_0)\in \{\bigcup _i U_\delta (r\sin (\frac{i}{2})+\pi , r\cos (\frac{i}{2})+\pi , \pi )| \; 0\le i<4\pi , i\in \mathbb {N}\}$, where $U_\delta (P)$ is a neighborhood of the point P with radius $\delta $. Our experiments (Fig. 6) show that, if $\delta \le 0.2$ GD will converge to a narrow local minima, while LS-GD convergences to the wider global minima.

Next, let us compare LS-GD with some popular optimization methods on the benchmark 2D-Rosenbrock function which is a non-convex function. The global minimum is inside a long, narrow, parabolic shaped flag valley. To find the valley is trivial. To converge to the global minimum, however, is difficult. The function is defined by

$$\begin{aligned} f(x, y) = (a-x)^2 + b(y-x^2)^2, \end{aligned}$$

(17)

it has a global minimum at $(x, y) = (a, a^2)$, and we set $a=1$ and $b=100$ in experiments.

Starting from the initial point with coordinate $(-3, -4)$, we run 2K iterations of the following optimizers including GD, GD with Nesterov momentum [31], Adam [21], RMSProp [44], and LS-GD ($\sigma =0.5$). The step size used for all these methods is $3e-3$. Figure 7 plots the iteration versus objective value, and it shows that GD together with Nesterov momentum converges faster than all the other algorithms. The second best algorithm is LS-GD. Meanwhile, Nesterov momentum can be used to speed up LS-GD, and we will show this numerically in training DNNs in Sect. 6.

Furthermore, we will show that LS-GD can be further accelerated by using Nesterov momentum. As shown in Fig. 8, the LS-GD together with Nesterov momentum converges much faster than GD with momentum, especially for high-dimensional Rosenbrock function.

6 Application to deep learning

6.1 Train neural nets with small batch size

Many advanced artificial intelligence tasks make high demands on training neural nets with extremely small batch sizes. The milestone technique for this is group normalization [47]. In this section, we show that LS-SGD successfully trains DNN with extremely small batch size. We consider LeNet-5 [23] for MNIST classification. Our network architecture is as follows

$$\begin{aligned} \text{ LeNet-5: }\ \mathrm{input}_{28 \times 28} \rightarrow \mathrm{conv}_{20, 5, 2} \rightarrow \mathrm{conv}_{50, 5, 2} \rightarrow \mathrm{fc}_{512} \rightarrow \mathrm{softmax}. \end{aligned}$$

The notation $\mathrm{conv}_{c, k, m}$ denotes a 2D convolutional layer with c output channels, each of which is the sum of a channel-wise convolution operation on the input using a learnable kernel of size $k \times k$, it further adds ReLU nonlinearity and max pooling with stride size m. $\mathrm{fc}_{512}$ is an affine transformation that transforms the input to a vector of dimension 512. Finally, the tensors are activated by a multi-class logistic function. The MNIST data are first passed to the layer $\mathrm{input}_{28 \times 28}$ and further processed by this hierarchical structure. We run 100 epochs of both SGD and LS-SGD with initial learning rate 0.01 and divide by 5 after 50 epochs and use a weight decay of 0.0001 and momentum of 0.9. Figure 9a plots the generalization accuracy on the test set with the LeNet5 trained with different batch sizes. For each batch size, LS-SGD with $\sigma =1.0$ keeps the testing accuracy more than $99.4\%$, SGD reduce the accuracy to $97\%$ when batch size 4 is used. The classification becomes just a random guess, when the model is trained by SGD with batch size 2. Small batch size leads to large noise in the gradient, which may make the noisy gradient not along the descent direction; however, Laplacian smoothing rescues this by decreasing the noise.

6.2 Improve generalization accuracy

On Cifar10 [22], we compare the performance of LS-SGD and SGD on ResNet with the pre-activated ResNet56 as an illustration. We take the same training strategy as that used in [17], except that we run 200 epochs with the learning rate decaying by a factor of 5 after every 40 epochs. For ResNet, instead of applying LS-SGD for all epochs, we only use LS-SGD in the first 40 epochs, and the remaining training is carried out by SGD. (This will save the extra computational cost due to LS, and we noticed that the performance is similar to the case when LS is used for the whole training process.) The parameter $\sigma $ is set to 1.0. Figure 9b depicts one path of the training and generalization accuracy of the neural nets trained by SGD and LS-SGD, respectively. It is seen that, even though the training accuracy obtained by SGD is higher than that by LS-SGD, the generalization is, however, inferior to that of LS-SGD. We carry out 25 replicas of this experiment; the histograms of the corresponding accuracy are shown in Fig. 10.

6.3 Training Wassersterin GAN

Generative adversarial networks (GANs) [15] are notoriously delicate and unstable to train [4]. In [27], Wasserstein-GANs (WGANs) are introduced to combat the instability in the training GANs. In addition to being more robust in training parameters and network architecture, WGANs provide a reliable estimate of the Earth Mover (EM) metric which correlates well with the quality of the generated samples. Nonetheless, WGANs training becomes unstable with a large learning rate or when used with a momentum based optimizer [27]. In this section, we demonstrate that the gradient smoothing technique in this paper alleviates the instability in the training and improves the quality of generated samples. Since WGANs with weight clipping are typically trained with RMSProp [44], we propose replacing the gradient g by a smoothed version $g_\sigma = {\varvec{A}}_\sigma ^{-1}g$ and also update the running averages using $g_\sigma $ instead of g. We name this algorithm LS-RMSProp.

To accentuate the instability in training and demonstrate the effects of gradient smoothing, we deliberately use a large learning rate for training the generator. We compare the regular RMSProp with the LS-RMSProp. The learning rate for the critic is kept small and trained approximately to convergence so that the critic loss is still an effective approximation to the Wasserstein distance. To control the number of unknowns in the experiment and make a meaningful comparison using the critic loss, we use the classical RMSProp for the critic and only apply LS-RMSProp to the generator.

We train the WGANs on the MNIST dataset using the DCGAN [35] for both the critic and generator. In Fig. 11 (left), we observe the loss for RMSProp trained with a large learning rate has multiple sharp spikes, indicating instability in the training process. The samples generated are also lower in quality, containing noisy spots as shown in Fig. 12a. In contrast, the curve of training loss for LS-RMSProp is smoother and exhibits fewer spikes. The generated samples as shown in Fig. 12b are also of better quality and visibly less noisy. The generated characters shown in Fig. 12b are more realistic compared to the ones shown in Fig. 12a. The effects are less pronounced with a small learning rate, but still result in a modest improvement in sample quality as shown in Fig. 12c, d. We also apply LS-RMSProp for training the critic, but do not see a clear improvement in the quality. This may be because the critic is already trained near optimality during each iteration and does not benefit much from gradient smoothing.

6.4 Deep reinforcement learning

Deep reinforcement learning (DRL) has been applied to playing games including Cartpole [9], Atari [30], Go [29, 42]. DNN plays a vital role in approximating the Q-function or policy function. We apply the Laplacian smoothed gradient to train the policy function to play the Cartpole game. We apply the standard procedure to train the policy function by using the policy gradient [9]. And we use the following network to approximate the policy function:

$$\begin{aligned} \mathrm{input}_4 \rightarrow \mathrm{fc}_{20} \rightarrow \mathrm{relu} \rightarrow \mathrm{fc}_2 \rightarrow \mathrm{softmax}. \end{aligned}$$

The network is trained by RMSProp and LS-RMSProp with $\sigma =1.0$, respectively. The learning rate and other related parameters are set to be the default ones in PyTorch. The training is stopped once the average duration of 5 consecutive episodes is more than 490. In each training episode, we set the maximal steps to be 500. Left and right panels of Fig. 13 depict a training procedure by using RMSProp and LS-RMSProp, respectively. We see that Laplacian smoothed gradient takes fewer episodes to reach the stopping criterion. Moreover, we run the above experiments 5 times independently and apply the trained model to play Cartpole. The game lasts more than 1000 steps for all the 5 models trained by LS-RMSProp, while only 3 of them lasts more than 1000 steps when the model is trained by vanilla RMSProp.

7 Convergence analysis

Note that the LS matrix ${\varvec{A}}_\sigma ^{-1}$ is positive definite and its largest and smallest eigenvalues are 1 and $\frac{1}{1+4\sigma }$, respectively. It is straightforward to show that all the convergence results for (S)GD still hold for LS-(S)GD. In this section, we will show some additional convergence for LS-(S)GD with a focus on LS-GD, the corresponding results for LS-SGD follow in a similar way.

Proposition 4

Consider the algorithm ${\varvec{w}}^{k+1} = {\varvec{w}}^k - \eta _k ({\varvec{A}}_\sigma ^n)^{-1} \nabla f({\varvec{w}}^k)$. Suppose f is L-Lipschitz smooth and $0< {\tilde{\eta }} \le \eta \le {\bar{\eta }}< \frac{2}{L}$. Then $\lim _{t\rightarrow \infty } \Vert \nabla f({\varvec{w}}^k)\Vert \rightarrow 0$. Moreover, if the Hessian $\nabla ^2 f$ of f is continuous and positive definite with ${\varvec{w}}^*$ being the minimizer of f, and ${\bar{\eta }}\Vert \nabla ^2 f\Vert <1$, then $\Vert {\varvec{w}}^k-{\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}\rightarrow 0$ as $k\rightarrow \infty $, and the convergence is linear.

Proof

By the Lipschitz continuity of $\nabla f$ and the descent lemma [5], we have

$$\begin{aligned} f({\varvec{w}}^{k+1})&\; = f({\varvec{w}}^k - \eta _k({\varvec{A}}_\sigma ^n)^{-1}\nabla f({\varvec{w}}^k)) \\&\; \le f({\varvec{w}}^k) - \eta _k \langle \nabla f({\varvec{w}}^k), ({\varvec{A}}_\sigma ^n)^{-1}\nabla f({\varvec{w}}^k))\rangle + \frac{\eta ^2_k L}{2} \Vert ({\varvec{A}}_\sigma ^n)^{-1} \nabla f({\varvec{w}}^k)\Vert ^2 \\&\; \le f({\varvec{w}}^k) - \eta _k \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 + \frac{\eta ^2_k L}{2} \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 \\&\; \le f({\varvec{w}}^k) - {\tilde{\eta }}\left( 1-\frac{{\bar{\eta }} L}{2}\right) \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2. \end{aligned}$$

Summing the above inequality over k, we have

$$\begin{aligned} {\tilde{\eta }}\left( 1-\frac{{\bar{\eta }} L}{2}\right) \sum _{k=0}^{\infty } \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2 \le f({\varvec{w}}^0) -\lim _{k\rightarrow \infty } f({\varvec{w}}^k)< \infty . \end{aligned}$$

Therefore, $ \Vert \nabla f({\varvec{w}}^k)\Vert _{({\varvec{A}}_\sigma ^n)^{-1}}^2\rightarrow 0$, and thus, $ \Vert \nabla f({\varvec{w}}^k)\Vert \rightarrow 0$.

For the second claim, we have

$$\begin{aligned}&{\varvec{w}}^{k+1} -{\varvec{w}}^* \\&\quad = \; {\varvec{w}}^k - {\varvec{w}}^* -\eta _k ({\varvec{A}}_\sigma ^n)^{-1}(\nabla f({\varvec{w}}^k) - \nabla f({\varvec{w}}^*)) \\&\quad = \; {\varvec{w}}^k - {\varvec{w}}^* -\eta _k ({\varvec{A}}_\sigma ^n)^{-1} \left( \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\cdot ({\varvec{w}}^k - {\varvec{w}}^*) \mathrm {d}\tau \right) \\&\quad = \; {\varvec{w}}^k - {\varvec{w}}^* -\eta _k ({\varvec{A}}_\sigma ^n)^{-1} \left( \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\mathrm {d}\tau \cdot ({\varvec{w}}^k - {\varvec{w}}^*)\right) \\&\quad = \; ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \left( {\varvec{I}}- \eta _k ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\mathrm {d}\tau ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}})\right) \\&\qquad \times ({\varvec{A}}_\sigma ^n)^{\frac{1}{2}} ({\varvec{w}}^k - {\varvec{w}}^*) \end{aligned}$$

Therefore,

$$\begin{aligned}&\Vert {\varvec{w}}^{k+1} - {\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n} \\&\quad \le \left\| {\varvec{I}}- \eta _t ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \int _0^1 \nabla ^2 f({\varvec{w}}^* + \tau ({\varvec{w}}^{k+1} - {\varvec{w}}^*))\mathrm {d}\tau ({\varvec{A}}_\sigma ^n)^{-\frac{1}{2}} \right\| \Vert {\varvec{w}}^k - {\varvec{w}}^*\Vert _{{\varvec{A}}_\sigma ^n}. \end{aligned}$$

So if $\eta _k \Vert \nabla ^2 f\Vert \le \frac{1}{\Vert ({\varvec{A}}_\sigma ^n)^{-1}\Vert } = 1$, the result follows. $\square $

Remark 5

The convergence result in Proposition 4 is also called $H_\sigma ^n$-convergence. This is because $\langle {\varvec{u}}, {\varvec{A}}_\sigma ^n {\varvec{u}}\rangle = \Vert {\varvec{u}}\Vert ^2 + \sigma \Vert {\varvec{D}}_+^n {\varvec{u}}\Vert ^2 = \Vert {\varvec{u}}\Vert ^2_{H_\sigma ^n}$.

8 Discussion and conclusion

8.1 Some more properties of Laplacian smoothing

In Theorem 1, we established a high probability estimate of the LS operator in reducing the $\ell _2$ norm of any given vector. The $\ell _1$ type of high probability estimation can be established in the same way. These estimates will be helpful to develop privacy-preserving optimization algorithms to train ML models that improve the utility of the trained models without sacrificing the privacy guarantee [45].

Regarding the $\ell _1$/$\ell _2$ estimates of the LS operator, we further have the following results.

Proposition 8

Given vectors ${\varvec{g}}$ and ${\varvec{d}}= {\varvec{A}}_\sigma ^{-1}{\varvec{g}}$, for any $p \in \mathbb {N}$, it holds that $ \Vert {\varvec{D}}_+^p{\varvec{d}}\Vert _1 \le \Vert {\varvec{D}}_+^p{\varvec{g}}\Vert _1. $ The inequality is strict unless ${\varvec{D}}_+^p {\varvec{g}}$ is a constant vector.

Proof

Observe that ${\varvec{A}}_\sigma $ and ${\varvec{D}}_+$ commute; therefore, for any $p\in \mathbb {N}$, ${\varvec{A}}_\sigma ({\varvec{D}}_+^p{\varvec{d}}) = {\varvec{D}}_+^p{\varvec{g}}$. Thus, we have

$$\begin{aligned} (1+2\sigma )({\varvec{D}}_+^p {\varvec{d}})_i = ({\varvec{D}}_+^p {\varvec{g}})_i + \sigma ({\varvec{D}}_+^p {\varvec{d}})_{i+1} + \sigma ({\varvec{D}}_+^p{\varvec{d}})_{i-1}. \end{aligned}$$

So

$$\begin{aligned} (1+2\sigma )|({\varvec{D}}_+^p {\varvec{d}})_i| \le |({\varvec{D}}_+^p {\varvec{g}})_i| + \sigma |({\varvec{D}}_+^p {\varvec{d}})_{i+1}| + \sigma |({\varvec{D}}_+^p{\varvec{d}})_{i-1}|. \end{aligned}$$

The inequality is strict if there are sign changes among the $({\varvec{D}}_+^p {\varvec{d}})_{i-1}$, $({\varvec{D}}_+^p {\varvec{d}})_{i}$, $({\varvec{D}}_+^p {\varvec{d}})_{i+1}$. Summing over i and using periodicity, we have

$$\begin{aligned} (1+2\sigma )\sum _{i=1}^m|({\varvec{D}}_+^p {\varvec{d}})_i| \le \sum _{i=1}^m|({\varvec{D}}_+^p {\varvec{g}})_i| + 2\sigma \sum _{i=1}^m|({\varvec{D}}_+^p {\varvec{d}})_{i}|, \end{aligned}$$

and the result follows. The inequality is strict unless ${\varvec{D}}_+^p {\varvec{g}}$ is a constant vector. $\square $

Proposition 5

Given any vector ${\varvec{g}}\in \mathbb {R}^m$ and ${\varvec{d}}=({\varvec{A}}_\sigma ^n)^{-1} {\varvec{g}}$, then

$$\begin{aligned} \Vert {\varvec{g}}\Vert ^2 = \Vert {\varvec{d}}\Vert ^2 + 2\sigma \Vert {\varvec{D}}_+^n {\varvec{d}}\Vert ^2 + \sigma ^2 \Vert {\varvec{L}}^n {\varvec{d}}\Vert ^2, \end{aligned}$$

(18)

the variance of ${\varvec{d}}$ is much less than that of ${\varvec{g}}$.

Proof

Observe that ${\varvec{g}}= {\varvec{A}}_\sigma ^n {\varvec{d}}= {\varvec{d}}+ (-1)^n\sigma {\varvec{L}}^n d$. Therefore,

$$\begin{aligned} \Vert {\varvec{g}}\Vert ^2= & {} \left\langle {\varvec{d}}+ (-1)^n\sigma {\varvec{L}}^n {\varvec{d}}, {\varvec{d}}+ (-1)^n\sigma {\varvec{L}}^n {\varvec{d}}\right\rangle \nonumber \\= & {} \Vert {\varvec{d}}\Vert ^2 + 2 (-1)^n \sigma \langle {\varvec{d}}, {\varvec{L}}^n {\varvec{d}}\rangle + \sigma ^2 \Vert {\varvec{L}}^n {\varvec{d}}\Vert ^2. \end{aligned}$$

(19)

Next, note ${\varvec{D}}_-$ and ${\varvec{D}}_+$ are commute; thus,

$$\begin{aligned} {\varvec{L}}^n = \underbrace{({\varvec{D}}_-{\varvec{D}}_+)\cdots ({\varvec{D}}_-{\varvec{D}}_+)}_{n} = \underbrace{{\varvec{D}}_-\cdots {\varvec{D}}_-}_{n} \underbrace{{\varvec{D}}_+\cdots {\varvec{D}}_+}_{n} = {\varvec{D}}_-^n {\varvec{D}}_+^n. \end{aligned}$$

(20)

Now, we have

$$\begin{aligned} \langle {\varvec{d}}, {\varvec{L}}^n {\varvec{d}}\rangle = \langle {\varvec{d}}, {\varvec{D}}_-^n {\varvec{D}}_+^n d \rangle = \langle ({\varvec{D}}_-^n)^T{\varvec{d}}, {\varvec{D}}_+^n {\varvec{d}}\rangle = \langle (-1)^n {\varvec{D}}_+^n {\varvec{d}}, {\varvec{D}}_+^n {\varvec{d}}\rangle = (-1)^n \Vert {\varvec{D}}_+^n {\varvec{d}}\Vert ^2,\nonumber \\ \end{aligned}$$

(21)

where we used (20) in the first equality and ${\varvec{D}}_-=-{\varvec{D}}_+^T$ in the second to last equality.

Substituting (21) into (19) yields (18). $\square $

8.2 Conclusion

In this paper, we proposed Laplacian smoothing gradient descent and its high-order generalizations. This simple modification dramatically reduces the variance and optimality gap in stochastic gradient descent, allows us to take a larger step size, and helps to find better minima. Extensive numerical examples ranging from toy cases and shallow and deep neural nets to generative adversarial networks and deep reinforcement learning all demonstrate the advantage of the proposed smoothed gradient.

Notes

Here, the condition number is the ratio of the largest and smallest eigenvalues of the Jacobian of the strongly convex objective functions.

References

254a, notes 1: Concentration of measure. https://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/
Abadi, M., Agarwal, A., et al.: Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (2016). arXiv preprint arXiv:1603.04467
Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, 1–51 (2018)
MathSciNet MATH Google Scholar
Arjovsky, M., Bottou, L.: Towards Principled Methods for Training Generative Adversarial Networks (2017). arXiv preprint arXiv:1701.04862
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)
MATH Google Scholar
Bhatia, R.: Matrix Analysis. Springer (1997)
Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks, Tricks of the Trade, Reloaded, p. 7700 (2012)
Bottou, L., Curtis, E.F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym (2016). arXiv preprint arXiv:1606.01540
Chaudhari, P., Oberman, A., Osher, S., Soatto, S., Guillame, C.: Deep Relaxation: Partial Differential Equations for Optimizing Deep Neural Networks (2017). arXiv preprint arXiv:1704.04932
Defazio, A., Bach, F.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (2014)
Dozat, T.: Incorporating nesterov momentum into Adam. In: 4th International Conference on Learning Representation Workshop (ICLR 2016) (2016)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Evans, L.C.: Partial Differential Equations (2010)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. In: 33rd International COnference on Machine Learning (ICML 2016) (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Jastrzebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y., Storkey, A.: Dnn’s sharpest directions along the sgd trajectory (2018). arXiv preprint arXiv:1807.05031
Johoson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (2013)
Jung, M., Chung, G., Sundaramoorthi, G., Vese, L., Yuille, A.: Sobolev gradients and joint variational image segmentation, denoising, and deblurring. In: Computational Imaging VII, volume 7246, pp. 72460I. International Society for Optics and Photonics (2009)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images (2009)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 81, 2278–2324 (1998)
Article Google Scholar
Lei, L., Ju, C., Chen, J., Jordan, M.: Nonconvex finite-sum optimization via scsg methods. In: Advances in Neural Information Processing Systems (2017)
Li, F., et al.: Cs231n: Convolutional Neural Networks for Visual Recognition (2018)
Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the Loss Landscape of Neural Nets (2017). arXiv preprint arXiv:1712.09913
Chintala, S., Arjovsky, M., Bottou, L.: Wasserstein Gan. arXiv preprint arXiv:1701.07875 (2017)
Mandt, S., Hoffman, M., Blei, D.: Stochastic gradient descent as approximate bayesian inference. J. Mach. Learn. Res. 18, 1–35 (2017)
MathSciNet MATH Google Scholar
Mnih, et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Nesterov, Y.: A method for solving the convex programming problem with convergence rate $o(1/k^2)$. Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
MathSciNet Google Scholar
Nesterov, Y.: Introductory lectures on convex programming volume i: Basic course. Lecture Notes (1998)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Networks 12(1), 145–151 (1999)
Article Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Reddi, S., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: 6th International Conference on Learning Representation (ICLR 2018) (2018)
Robinds, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Schmidhuber, J.: Deep learning in neural networks: an overview. arXiv preprint arXiv:1404.7828 (2014)
Senior, A., Heigold, G., Ranzato, M., Yang, K.: An empirical study of learning rates in deep neural networks for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2013)
Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: 30th International Conference on Machine Learning (ICML 2013) (2013)
Shapiro, A., Wardi, Y.: Convergence analysis of gradient descent stochastic algorithms. J. Optim. Theory Appl. 91(2), 439–454 (1996)
Article MathSciNet Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016)
Article Google Scholar
Sutton, R.: Two problems with backpropagation and other steepest-descent learning procedures for networks. In: Proc. 8th Annual Conf. Cognitive Science Society (1986)
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks Mach. Learn 4(2), 26–31 (2012)
Wang, B., Gu, Q., Boedihardjo, M., Barekat, F., Osher, S.: Privacy-preserving erm by laplacian smoothing stochastic gradient descent. UCLA Computational and Applied Mathematics Reports, pp. 19–24 (2019)
Welling, M., Teh, Y.: Bayesian learning via stochastic gradient langevin dynamics. In 28th International Conference on Machine Learning (ICML 2011) (2011)
Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vision (2018)
Zeiler, M.: Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)

Download references

Funding

This material is based on research sponsored by NSF grants DMS-1924935, DMS-1952339, DMS-2110145, DMS-2152762, DMS-2208361, DOE grant DE-SC0021142, and ONR grant N00014-18-1-2527 and the ONR MURI grant N00014-20-1-2787.

Author information

Authors and Affiliations

Department of Mathematics, UCLA, Los Angeles, CA, USA
Stanley Osher, Xiyang Luo, Farzin Barekat, Minh Pham & Alex Lin
Department of Mathematics, Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA
Bao Wang
Department of Mathematics, SUNY Albany, Albany, NY, USA
Penghang Yin

Authors

Stanley Osher
View author publications
You can also search for this author in PubMed Google Scholar
Bao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Penghang Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xiyang Luo
View author publications
You can also search for this author in PubMed Google Scholar
Farzin Barekat
View author publications
You can also search for this author in PubMed Google Scholar
Minh Pham
View author publications
You can also search for this author in PubMed Google Scholar
Alex Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bao Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Proof of Theorem 1

In this part, we will give a proof for Theorem 1.

Lemma 2

[1] Let $t, u > 0$, ${\varvec{v}}$ be an m-dimensional standard normal random vector, and let $F:\mathbb {R}^m \rightarrow \mathbb {R}$ be a function such that $\Vert F({\varvec{x}}) - F({\varvec{y}})\Vert \le \Vert {\varvec{x}}- {\varvec{y}}\Vert $ for all ${\varvec{x}}$, ${\varvec{y}}\in \mathbb {R}^m$. Then

$$\begin{aligned} \mathbb {P}\left( F({\varvec{v}}) \ge \mathbb {E}F({\varvec{v}}) + u \right) \le \exp {\left( -tu+\frac{1}{2}\left( \frac{\pi t}{2}\right) ^2\right) }. \end{aligned}$$

(22)

Taking $t=\frac{4}{\pi ^2}$ in Lemma 2, we obtain

Lemma 3

Let $u > 0$, ${\varvec{v}}$ be an m-dimensional standard normal random vector, and let $F:\mathbb {R}^m \rightarrow \mathbb {R}$ be a function such that $\Vert F({\varvec{x}}) - F({\varvec{y}})\Vert \le \Vert {\varvec{x}}- {\varvec{y}}\Vert $ for all ${\varvec{x}}$, ${\varvec{y}}\in \mathbb {R}^m$. Then

$$\begin{aligned} \mathbb {P}\left( F({\varvec{v}}) \ge \mathbb {E}F({\varvec{v}}) + u \right) \le \exp {\left( -\frac{2}{\pi ^2}u^2 \right) }. \end{aligned}$$

(23)

Lemma 4

Let ${\varvec{v}}$ be an m-dimensional standard normal random vector. Let $1\le p\le \infty $. Let $0< u <\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}$. Let ${\varvec{T}}\in \mathbb {R}^{m\times m}$ be such that $\Vert {\varvec{T}}{\varvec{x}}\Vert _{\ell _{p}}\le \Vert {\varvec{x}}\Vert _{\ell _{p}}$ for all ${\varvec{x}}\in \mathbb {R}^{m}$. Then

$$\begin{aligned} \mathbb {P}\left( \Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \frac{\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u}{\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u} \Vert {\varvec{v}}\Vert _{\ell _{p}}\right) \le 2\exp {\left( -\frac{2}{\pi ^{2}}u^{2}\right) }. \end{aligned}$$

Proof

By Lemma 3,

$$\begin{aligned} \mathbb {P}(\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}} \end{aligned}$$

and

$$\begin{aligned} \mathbb {P}(-\Vert {\varvec{v}}\Vert _{\ell _{p}}\ge -\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

The second inequality gives

$$\begin{aligned} \mathbb {P}(\Vert {\varvec{v}}\Vert _{\ell _{p}}\le \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u)\le e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

Therefore,

$$\begin{aligned}&\mathbb {P}\left( \Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \frac{\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u}{ \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u}\Vert {\varvec{v}}\Vert _{\ell _{p}}\right) \\&\quad \le \mathbb {P}(\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u)+ \mathbb {P}(\Vert {\varvec{v}}\Vert _{\ell _{p}}\le \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u)\le 2e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

$\square $

Lemma 5

Let $1\le p\le 2$. Let ${\varvec{T}}\in \mathbb {R}^{m\times m}$. Let ${\varvec{v}}$ be an m-dimensional standard normal random vector. Then

$$\begin{aligned} \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\le m^{\frac{1}{p}-\frac{1}{2}}(\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}})^{\frac{1}{2}}\left( \mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}, \end{aligned}$$

where ${\varvec{v}}_1$ is the first coordinate of ${\varvec{v}}$.

Proof

We write ${\varvec{T}}=({\varvec{T}}_{i,j})_{1\le i,j\le n}$. Then

$$\begin{aligned} \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}= & {} \mathbb {E}\left( \sum _{i=1}^{n}\left| \sum _{j=1}^{n}{\varvec{T}}_{i,j}{\varvec{v}}_{j}\right| ^{p}\right) ^{\frac{1}{p}}\le \left( \sum _{i=1}^{n}\mathbb {E}\left| \sum _{j=1}^{n}{\varvec{T}}_{i,j}{\varvec{v}}_{j}\right| ^{p}\right) ^{\frac{1}{p}}\\= & {} \left( \sum _{i=1}^{n}\left( \sum _{j=1}^{n}{\varvec{T}}_{i,j}^{2}\right) ^{\frac{p}{2}}\mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}\le \left( n^{1-\frac{p}{2}}\left( \sum _{1\le i,j\le n}{\varvec{T}}_{i,j}^{2}\right) ^{\frac{p}{2}}\mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}\\= & {} n^{\frac{1}{p}-\frac{1}{2}}\left( \text {Trace }{\varvec{T}}^{*}{\varvec{T}}\right) ^{\frac{1}{2}}\left( \mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}, \end{aligned}$$

where the second equality follows from the assumption that ${\varvec{v}}$ is an m-dimensional standard normal random vector. $\square $

Lemma 6

Let ${\varvec{v}}$ be an m-dimensional standard normal random vector. Then

$$\begin{aligned} \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge \sqrt{m}-\pi . \end{aligned}$$

Proof

By Lemma 3,

$$\begin{aligned} \mathbb {P}(\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}} \end{aligned}$$

and

$$\begin{aligned} \mathbb {P}(-\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge -\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

Thus,

$$\begin{aligned} \mathbb {P}(|\Vert {\varvec{v}}\Vert _{\ell _{2}}-\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}|\ge u)\le 2e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

Consider the random variable $W=\Vert {\varvec{v}}\Vert _{\ell _{2}}$. We have

$$\begin{aligned} \mathbb {E}|W-\mathbb {E}W|^{2}=\int _{0}^{\infty }\mathbb {P}(|W-\mathbb {E}W|\ge \sqrt{u})\,du\le \int _{0}^{\infty }2e^{-\frac{2}{\pi ^{2}}u}\,du=\pi ^{2}. \end{aligned}$$

Since $\mathbb {E}|W-\mathbb {E}W|^2 = \mathbb {E}W^2 - (\mathbb {E}W)^2$, we have

$$\begin{aligned} \mathbb {E}W\ge (\mathbb {E}W^{2})^{\frac{1}{2}}-(\mathbb {E}|W-\mathbb {E}W|^{2})^{\frac{1}{2}}\ge \sqrt{m}-\pi . \end{aligned}$$

$\square $

Lemma 7

Let $0<\epsilon <1-\frac{\pi }{\sqrt{m}}$. Let $\sigma >0$. Let

$$\begin{aligned} \beta =\frac{1}{m}\sum _{i=1}^{m}\frac{1}{|1+2\sigma -\sigma z_{i}-\sigma \overline{z_{i}}| }, \end{aligned}$$

where $z_{1},\ldots ,z_{m}$ are the m roots of unity. Let ${\varvec{B}}$ be the circular shift operator on $\mathbb {R}^{m}$. Let ${\varvec{v}}$ be an m-dimensional standard normal random vector. Then

$$\begin{aligned} \mathbb {P}\left( \Vert ((1+2\sigma ){\varvec{I}}-\sigma {\varvec{B}}-\sigma {\varvec{B}}^{*})^{-1/2}{\varvec{v}}\Vert _{\ell _{2}}\ge \frac{\sqrt{\beta }+\epsilon }{1-\frac{\pi }{\sqrt{m}}-\epsilon } \Vert {\varvec{v}}\Vert _{\ell _{2}}\right) \le 2e^{-\frac{2}{\pi ^{2}}m\epsilon ^{2}}. \end{aligned}$$

Proof

Let ${\varvec{T}}=((1+2\sigma ){\varvec{I}}-\sigma {\varvec{B}}-\sigma {\varvec{B}}^{*})^{-1/2}$. Taking $u=\sqrt{m}\epsilon $ in Lemma 4, we have

$$\begin{aligned} \mathbb {P}\left( \Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}\ge \frac{\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}+\sqrt{m}\epsilon }{\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}-\sqrt{m}\epsilon } \Vert {\varvec{v}}\Vert _{l^{2}}\right) \le 2e^{-\frac{2}{\pi ^{2}}m\epsilon ^{2}}. \end{aligned}$$

By Lemma 5, $\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}\le (\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}})^{\frac{1}{2}}$, we have $\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}}=m\beta $. It is easy to show that $\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}}=m\beta $ So $\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}\le \sqrt{m\beta }$. Also by Lemma 6, $\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge \sqrt{m}-\pi $. Therefore,

$$\begin{aligned} \mathbb {P}\left( \Vert ((1+2\sigma ){\varvec{I}}-\sigma {\varvec{B}}-\sigma {\varvec{B}}^{*})^{-1/2}{\varvec{v}}\Vert _{\ell _{2}}\ge \frac{\sqrt{\beta }+\epsilon }{1-\frac{\pi }{\sqrt{m}}-\epsilon } \Vert {\varvec{v}}\Vert _{\ell _{2}}\right) \le 2e^{-\frac{2}{\pi ^{2}}m\epsilon ^{2}}. \end{aligned}$$

$\square $

Proof of Theorem 1

Theorem 1 follows from Lemma 7 by substituting $\frac{{\varvec{v}}}{\Vert {\varvec{v}}\Vert _{\ell _2}}$ and using homogeneity and direct calculations. $\square $

1.2 Proof of Theorem 2

In this part, we will give a proof for Theorem 2.

Lemma 8

[6] Let $\prec _{w}$ denote weak majorization. Denote eigenvalues of Hermitian matrix ${\varvec{X}}$ by $\lambda _1({\varvec{X}})\ge \ldots \ge \lambda _m({\varvec{X}})$. For every two Hermitian positive definite matrices ${\varvec{A}}$ and ${\varvec{B}}$, we have

$$\begin{aligned} (\lambda _1({\varvec{A}}{\varvec{B}}),\cdots ,\lambda _m({\varvec{A}}{\varvec{B}})) \prec _w (\lambda _1({\varvec{A}})\lambda _1({\varvec{B}}),\cdots ,\lambda _m({\varvec{A}})\lambda _m({\varvec{B}})). \end{aligned}$$

In particular,

$$\begin{aligned} \sum _{j=1}^{m} \lambda _j({\varvec{A}}{\varvec{B}}) \le \sum _{j=1}^{m}\lambda _j({\varvec{A}})\lambda _j({\varvec{B}}). \end{aligned}$$

proof of Theorem 2

Let $\lambda _1\ge \ldots \ge \lambda _m$ denote the eigenvalues of $\Sigma $. The eigenvalues of $(A_\sigma ^{n})^{-2}$ are given by $\{[1+4^n\sigma \sin ^{2n}(\pi j/m)]^{-2}\}_{j=0}^{j=m-1}$, which we denote by $1=\alpha _1\ge \ldots \ge \alpha _m\ge (1+4^n\sigma )^{-2}$. We have

$$\begin{aligned} \sum _{j=1}^{m}\mathrm {Var}[{\varvec{n}}_j] = {{\,\mathrm{trace}\,}}(\Sigma ) = \sum _{j=1}^{m} \lambda _j. \end{aligned}$$

(24)

On the other hand we also have

$$\begin{aligned} \sum _{j=1}^{m}\mathrm {Var}[({\varvec{A}}_\sigma ^n)^{-1} {\varvec{n}}_j] = {{\,\mathrm{trace}\,}}(({\varvec{A}}_\sigma ^n)^{-1}\Sigma ({\varvec{A}}_\sigma ^n)^{-1}) = {{\,\mathrm{trace}\,}}(({\varvec{A}}_\sigma ^n)^{-2}\Sigma ) \le \sum _{j=1}^{m} \alpha _j \lambda _j, \end{aligned}$$

(25)

where the last inequality is by Lemma 8. Now,

$$\begin{aligned}&\sum _{j=1}^{m} \lambda _j - \sum _{j=1}^{m} \alpha _j \lambda _j = \sum _{j=1}^{m} (1-\alpha _j)\lambda _j\ge \lambda _m (m - \sum _{j=1}^{m} \alpha _j) \\&\quad = \frac{\lambda _1}{\kappa } (m - \sum _{j=1}^{m} \alpha _j)\ge \frac{\sum _{j=1}^{m}\lambda _j}{m\kappa } (m - \sum _{j=1}^{m} \alpha _j) \end{aligned}$$

Rearranging and simplifying above implies that

$$\begin{aligned} \sum _{j=1}^{m}\alpha _j\lambda _j \le (\sum _{j=1}^{m} \lambda _j)(1-\frac{1}{\kappa }+\frac{ \sum _{j=1}^{m} \alpha _j}{m\kappa }). \end{aligned}$$

Substituting (24) and (25) in the above inequality yields (11). $\square $

1.3 Proof of Lemma 1

To prove Lemma 1, we first introduce the following lemma.

Lemma 9

For $0\le \theta \le 2\pi $, suppose

$$\begin{aligned} F(\theta ) = \frac{1}{1+2\sigma (1-\cos (\theta ))} \end{aligned}$$

has the discrete-time Fourier transform of series f[k]. Then, for integer k,

$$\begin{aligned} f[k] = \frac{\alpha ^{|k|}}{\sqrt{4\sigma +1}} \end{aligned}$$

where

$$\begin{aligned} \alpha = \frac{2\sigma +1 - \sqrt{4\sigma +1}}{2\sigma } \end{aligned}$$

Proof

By definition,

$$\begin{aligned} f[k] = \frac{1}{2\pi } \int _{0}^{2\pi } F(\theta ) e^{ik\theta } \,d\theta = \frac{1}{2\pi } \int _{0}^{2\pi } \frac{e^{ik\theta }}{1+2\sigma (1-\cos (\theta ))} \,d\theta . \end{aligned}$$

(26)

We compute (26) by using Residue theorem. First, note that because $F(\theta )$ is real valued, $f[k]=f[-k]$; therefore, it suffices to compute (26)) for nonnegative k. Set $z=e^{i\theta }$. Observe that $\cos (\theta )=0.5(z+1/z)$ and $dz=iz d\theta $. Substituting in (26) and simplifying yields that

$$\begin{aligned} f[k] = \frac{-1}{2\pi i \sigma }\oint \frac{z^k}{(z-\alpha _{-})(z-\alpha _{+}) } \,dz, \end{aligned}$$

(27)

where the integral is taken around the unit circle, and $\alpha _{\pm }= \frac{2\sigma +1 \pm \sqrt{4\sigma +1}}{2\sigma }$ are the roots of quadratic $-\sigma z^2 +(2\sigma +1)z -\sigma $. Note that $\alpha _{-}$ lies within the unit circle, whereas $\alpha _{+}$ lies outside of the unit circle. Therefore, because k is nonnegative, $\alpha _{-}$ is the only singularity of the integrand in (27) within the unit circle. A straightforward application of the Residue Theorem in complex analysis yields that

$$\begin{aligned} f[k] = \frac{- \alpha _{-}^{k}}{\sigma (\alpha _{-}-\alpha _{+})} = \frac{\alpha ^{k}}{\sqrt{4\sigma +1}}. \end{aligned}$$

This completes the proof. $\square $

Proof of Lemma 1

First observe that we can re-write $\beta $ as

$$\begin{aligned} \frac{1}{d}\sum _{j=0}^{d-1} \frac{1}{1+2\sigma (1-\cos (\frac{2\pi j}{d}))}. \end{aligned}$$

(28)

It remains to show that the above summation is equal to $\frac{1+\alpha ^d}{(1-\alpha ^d)\sqrt{4\sigma +1}}$. This follows by Lemmas 9 and standard sampling results in Fourier analysis (i.e., sampling $\theta $ at points $\{2\pi j/d\}_{j=0}^{d-1}$). Nevertheless, we provide the details here for completeness: Observe that that the inverse discrete-time Fourier transform of

$$\begin{aligned} G(\theta ) = \sum _{j=0}^{d-1}\delta \bigg (\theta -\frac{2\pi j }{d}\bigg ). \end{aligned}$$

is given by

$$\begin{aligned} g[k] = {\left\{ \begin{array}{ll} d/2\pi \qquad &{}\text {if { k} divides { d},}\\ 0 \qquad &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

Furthermore, let

$$\begin{aligned} F(\theta ) = \frac{1}{1+2\sigma (1-\cos (\theta ))}, \end{aligned}$$

and use f[k] to denote its inverse discrete-time Fourier transform. Now,

$$\begin{aligned}&\frac{1}{d}\sum _{j=0}^{d-1} \frac{1}{1+2\sigma (1-\cos (\frac{2\pi j}{d}))} = \frac{1}{d} \int _0^{2\pi } F(\theta )G(\theta ) \\&\quad = \frac{2\pi }{d} {{\,\mathrm{\mathrm {DTFT}}\,}}^{-1}[F\cdot G][0] = \frac{2\pi }{d} ({{\,\mathrm{\mathrm {DTFT}}\,}}^{-1}[F] * {{\,\mathrm{\mathrm {DTFT}}\,}}^{-1}[G])[0] \\&\quad = \frac{2\pi }{d} \sum _{r=-\infty }^{\infty } f[-r]g[r] = \frac{2\pi }{d} \sum _{\ell =-\infty }^{\infty } f[-\ell d] \frac{d}{2\pi } = \sum _{\ell =-\infty }^{\infty } f[-\ell d]. \end{aligned}$$

The proof is completed by substituting the result of Lemma 9 in the above sum and simplifying. $\square $

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Osher, S., Wang, B., Yin, P. et al. Laplacian smoothing gradient descent. Res Math Sci 9, 55 (2022). https://doi.org/10.1007/s40687-022-00351-1

Download citation

Received: 13 October 2021
Accepted: 19 July 2022
Published: 12 August 2022
DOI: https://doi.org/10.1007/s40687-022-00351-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Laplacian smoothing gradient descent

Abstract

Similar content being viewed by others

Stochastic proximal-gradient algorithms for penalized mixed models

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Proximal Gradient Method with Extrapolation and Line Search for a Class of Non-convex and Non-smooth Problems

Explore related subjects

1 Introduction

1.1 Our contribution

1.2 Related work

1.3 Notations

1.4 Organization

2 Laplacian smoothing (stochastic) gradient descent

Remark 1

2.1 Generalized smoothing gradient descent

Remark 2

Remark 3

3 The choice of step size

Definition 1

3.1 \(\ell _2\) estimates

Theorem 1

Lemma 1

4 Variance reduction

4.1 Gaussian noise scenario

Theorem 2

Proposition 1

Proof

Proposition 2

Proof

4.2 Reduce the optimality gap

Proposition 3

Proof

Remark 4

4.2.1 Optimization for quadratic function with noise corrupted gradient

4.2.2 Find the center of multiple points

4.2.3 Multi-class logistic regression

4.3 Iteration versus loss

4.4 Variance reduction in stochastic gradient

5 Numerical results on avoiding local minima and speeding up convergence

6 Application to deep learning

6.1 Train neural nets with small batch size

6.2 Improve generalization accuracy

6.3 Training Wassersterin GAN

6.4 Deep reinforcement learning

7 Convergence analysis

Proposition 4

Proof

Remark 5

8 Discussion and conclusion

8.1 Some more properties of Laplacian smoothing

Proposition 8

Proof

Proposition 5

Proof

8.2 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Theorem 1

Lemma 2

Lemma 3

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Proof of Theorem 1

1.2 Proof of Theorem 2

Lemma 8

proof of Theorem 2