New stepsizes for the gradient method

Sun, Cong; Liu, Jin-Peng

doi:10.1007/s11590-019-01512-y

New stepsizes for the gradient method

Original Paper
Published: 28 November 2019

Volume 14, pages 1943–1955, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Optimization Letters Aims and scope Submit manuscript

New stepsizes for the gradient method

Download PDF

700 Accesses
6 Citations
Explore all metrics

Abstract

Gradient methods are famous for their simplicity and low complexity, which attract more and more attention for large scale optimization problems. A good stepsize plays an important role to construct an efficient gradient method. This paper proposes a new framework to generate stepsizes for gradient methods applied to convex quadratic function minimization problems. By adopting different criterions, we propose four new gradient methods. For 2-dimensional unconstrained problems with convex quadratic objective functions, we prove that the new methods either terminate in finite iterations or converge R-superlinearly; for n-dimensional problems, we prove that all the new methods converge R-linearly. Numerical experiments show that the new methods enjoy lower complexity and outperform the existing gradient methods.

An efficient gradient method with approximate optimal stepsize for large-scale unconstrained optimization

Article 21 June 2017

A conjugate gradient method with sufficient descent property

Article 12 December 2014

On the acceleration of the Barzilai–Borwein method

Article 13 January 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The gradient method plays an important role in solving large scale optimization problems. It is widely used in many applications, e.g., machine learning and inverse problems [24]. Thus it is necessary to exploit efficient stepsizes for gradient methods and their theoretical analysis.

Consider the unconstrained convex quadratic optimization problem:

$$\begin{aligned} \min _{x\in \mathbf {R}^n}f(x)=b^Tx+\frac{1}{2}x^THx, \end{aligned}$$

(1)

where $b\in \mathbb {R}^n$, and $H\in \mathbb {R}^{n\times n}$ is symmetric positive definite. We want to find the minimal value of f(x) by means of gradient methods. Let $x_k$ be the iterate point achieved from the kth iteration, and $g_k=\nabla f(x_k)=Hx_k+b$ be the gradient vector at $x_k$. The iteration formulation of the gradient method is:

$$\begin{aligned} x_{k+1}=x_k-\alpha _k g_k. \end{aligned}$$

(2)

Different choices of the stepsize $\alpha _k$ lead to various performances. One particular choice is the exact linesearch, which is also called Cauchy stepsize [2, 3]:

$$\begin{aligned} \alpha _k^{*}={\text {argmin}}_{\alpha >0} f(x_k-\alpha g_k). \end{aligned}$$

It has the closed form as:

$$\begin{aligned} \alpha _k^{*}=\frac{\Vert g_k\Vert _2^2}{g_k^THg_k}. \end{aligned}$$

By using exact linesearch steps, the gradient method becomes the steepest descent method. It has linear convergence rate [23], but suffers from the zigzag phenomenon.

Another choice of stepsize was proposed by Barzilai and Borwein [1]. The BB stepsizes are designed as the approximation of the inverse of the Hessian matrix, and have certain quasi-Newton property. This provides two BB stepsizes as:

$$\begin{aligned}&\alpha _k^{BB1}=\frac{s_k^Ts_k}{s_k^Ty_k}={\text {argmin}}_{\alpha }\Vert y_k-\alpha ^{-1}s_k\Vert ^2,\\&\alpha _k^{BB2}=\frac{s_k^Ty_k}{y_k^Ty_k}={\text {argmin}}_{\alpha }\Vert s_k-\alpha y_k\Vert ^2, \end{aligned}$$

where $s_k=x_k-x_{k-1}$, $y_k=g_k-g_{k-1}$. For 2-dimensional convex quadratic objective functions, Barzilai and Borwein [1] showed that the gradient method (2) with the BB stepsize $\alpha _k^{BB1}$ converges R-superlinearly and the R-order is $\sqrt{2}$. Its R-linear convergence rate for general n-dimensional problems is proved in [8].

This result has inspired many studies on the gradient method [4,5,6, 8, 11, 15, 17,18,19,20, 25]. For example, Dai [4] suggested to use an alternate step (AS) method, which combines the Cauchy stepsize with the BB stepsize:

$$\begin{aligned} \alpha _k^{AS}=\left\{ \begin{array}{ll} \alpha _k^{*},&{}\quad \mod (k,2)=1; \\ \alpha _k^{BB},&{}\quad \mod (k,2)=0. \end{array} \right. \end{aligned}$$

Yuan [21] proposed a new kind of stepsize (Yuan stepsize) which ensures 3-step termination for minimizing 2-dimensional convex quadratic objective functions:

$$\begin{aligned} \alpha _k=\left\{ \begin{array}{ll} \alpha _k^{Y}=2\left( \sqrt{\left( \frac{1}{\alpha _{k-1}^{*}}-\frac{1}{\alpha _k^{*}}\right) ^2+\frac{4\Vert g_k\Vert _2^2}{ (\Vert s_{k-1}\Vert _2)^2}}+\frac{1}{\alpha _{k-1}^{*}}+\frac{1}{\alpha _k^{*}}\right) ^{-1},&{}\quad \mod (k,t)=0;\\ \alpha _k^{*},&{}\quad {\text {otherwise}}. \end{array} \right. \end{aligned}$$

(3)

Later on, Dai and Yuan [7] proposed a variant strategy of Yuan stepsize, which improves the aforementioned method and keeps the finite termination property for 2-dimensional convex quadratic function minimization:

$$\begin{aligned} \alpha _k^{DY}=\left\{ \begin{aligned}&\alpha _k^{*},&\mod (k,4)=1,2;\\&\alpha _k^{YV}=2\left( \sqrt{\left( \frac{1}{\alpha _{k-1}^{*}}-\frac{1}{\alpha _k^{*}}\right) ^2+\frac{4\Vert g_k\Vert _2^2}{ (\alpha _{k-1}^{*}\Vert g_{k-1}\Vert _2)^2}}+\frac{1}{\alpha _{k-1}^{*}}+\frac{1}{\alpha _k^{*}}\right) ^{-1},&\mod (k,4)=0,3. \end{aligned} \right. \end{aligned}$$

(4)

In the framework of (3), $\alpha _k^{Y}$ is equivalent to $\alpha _k^{YV}$. But $\alpha _k^{YV}$ in (4) is different from $\alpha _k^{Y}$. The new method (4), short as DY method, performs very competitive with the BB method for large scale problems. Due to the numerical experiences, Yuan [22] believed that “a good gradient method would use at least one exact linesearch in every few iterations”. Based on this idea, a number of gradient methods have been discussed. Recent work by De Asmundis et al. [10] modified the DY method and proposed the SDC gradient method. In one loop, it consists of $h+l$ inner iterations, where the first h iterations ($h\ge 2$) apply the Cauchy stepsizes, and the rest l iterations ($l\ge 1$) use the fixed steplengths generated from Yuan step. The detailed stepsize is updated as follows:

$$\begin{aligned} \alpha _k^{SDC }=\left\{ \begin{array}{ll} \alpha _k^{*},&{} \quad \mod (k,h+l)<h;\\ \alpha _k^{YV},&{}\quad \mod (k,h+l)=h; \\ \alpha _{k-1},&{} \quad {\text {otherwise.}} \end{array} \right. \end{aligned}$$

(5)

The work [10] also exploited the asymptotic spectral behaviour of the DY method, and proved that the SDC algorithm has R-linear convergence rate for minimizing convex quadratic functions. The SDC method outperforms the DY method with proper parameters h and l.

In [12], di Serafino et al. analyzed several gradient methods, including the SDC method, Adaptive BB method [14], Limited Memory Steepest Descent (LMSD) method [13] and so on. The methods were also extended to general unconstrained problems. It showed that the gradient methods with groups of short steps followed by several long steps are efficient for strongly convex quadratic problems. Similar analysis for the SD method was presented in [16].

Inspired by the SDC method, we propose new gradient methods by modifying (5). The modification mainly include two aspects. On one hand, the basic framework is improved, and only two exact linesearch steps are required in a loop. On the other hand, several new stepsizes are plugged in the new framework. For 2-dimensional problems, we prove that the corresponding gradient methods either terminate in finite iterations or converge R-superlinearly. For general n-dimensional problems ($n>2$), we prove that all the proposed gradient methods converge R-linearly. The rest of the paper is organized as follows. The new methods are introduced in Sect. 2. The convergence analysis is shown in Sect. 3. In Sect. 4, numerical results show that our new gradient algorithms outperform the state of the art.

2 New gradient methods

In this section, new gradient methods are proposed.

2.1 New framework

Consider problem (1). We construct new gradient methods by proposing new stepsizes $\alpha _k$. First, a cyclic framework that combines Cauchy steps with fixed steplengthes is proposed as follows:

$$\begin{aligned} \alpha _k=\left\{ \begin{array}{ll} \alpha _k^{*}, &{}\quad {\text {if}}\,\mod (k,m)<2; \\ \alpha _{k-1}^{F}, &{}\quad {\text {if}}\, \mod (k,m)=2; \\ \alpha _{k-1}, &{}\quad {\text {otherwise}}. \end{array} \right. \end{aligned}$$

(6)

Here the stepsize $\alpha _{k}^{F}$ has different choices, which is analyzed in the next subsection. In each loop, we take 2 Cauchy steps followed by $m-2$ gradient steps with fixed stepsizes, where the fixed stepsize is computed according to the 2 Cauchy steps.

Such framework is different from the SDC method [10], in the following two aspects. First, the new framework only requires 2 exact linesearch steps, while (5) requires $h+1$. Thus the computational cost is reduced compared to SDC. Second, the key stepsize $\alpha _{k-1}^{F}$ is constructed by the information of the last ($(k-1)$th) iteration rather than the current (kth) iteration, which makes the generated iterate points completely different from SDC.

2.2 New stepsizes

In the new framework, we apply various types of stepsizes. Here we consider four different choices of $\alpha _k^F$ as follows:

$$\begin{aligned}&\alpha _k^{F(1)} = \alpha _k^{YV} \,\end{aligned}$$

(7)

$$\begin{aligned}&\alpha _k^{F(2)} =\tilde{\alpha }_k=\left( \frac{1}{\alpha _{k-1}^{*}}+\frac{1}{\alpha _k^{*}}\right) ^{-1}\, ,\end{aligned}$$

(8)

$$\begin{aligned}&\alpha _k^{F(3)} =\alpha _{k}^{Fmin}=\min \{\alpha _{k-1}^{*},\alpha _{k}^{*}\} \, ,\end{aligned}$$

(9)

$$\begin{aligned}&\alpha _k^{F(4)} =\alpha _{k}^{Fmax}=\max \{\alpha _{k-1}^{*},\alpha _{k}^{*}\} \, . \end{aligned}$$

(10)

The stepsizes (7) and (8) were proposed in [7, 9], respectively. As analyzed in [10], the stepsizes (7) and (8) approximate the inverse of the largest eigenvalue and that of the sum of the largest and smallest eigenvalues of the Hessian matrix, respectively. If the approximation is sufficiently accurate, we can reduce the gradient component corresponding to the largest eigenvalue significantly. In this case the problem is reduced into a lower dimensional problem. Hopefully the dimension reduction happens frequently. The stepsizes (9) and (10) are proposed in this paper, with the purpose to approximate the inverse eigenvalues of the Hessian matrix H on the 2-dimensional subspace spanned by $g_k$ and $g_{k-1}$. Let $v_k=\frac{g_k}{\Vert g_k\Vert }$ and $v_{k-1}=\frac{g_{k-1}}{\Vert g_{k-1}\Vert }$. Then the Hessian matrix H on the subspace $\mathcal {S}={\text {Span}}\{g_k,g_{k-1}\}$ is expressed as

$$\begin{aligned} H_{\mathcal {S}}=\left( \begin{array}{cc} v_k^THv_k, &{} v_k^THv_{k-1} \\ v_{k-1}^THv_k, &{} v_{k-1}^THv_{k-1} \end{array} \right) = \left( \begin{array}{cc} \frac{1}{\alpha _k^{*}} &{} v_k^THv_{k-1} \\ v_{k-1}^THv_k &{} \frac{1}{\alpha _{k-1}^{*}} \end{array} \right) . \end{aligned}$$

The eigenvalues of $H_{\mathcal {S}}$ can be approximated by the diagonal elements (9) and (10). Compared to the stepsizes (7) and (8), the computational cost of (9) and (10) are lower. In summary, all these four stepsizes are designed to approximate the inverse of the eigenvalues of the Hessian matrix H; the new framework uses the fewest Cauchy steps to generate the aforementioned four stepsizes, and combines them to reduce the problem into lower and lower dimensional subspaces.

It is easy to show the following relations among the four stepsizes:

$$\begin{aligned} \tilde{\alpha }_k\le \alpha _{k}^{YV}\le \alpha _k^{Fmin} \le \alpha _k^{Fmax} \, . \end{aligned}$$

(11)

Let $\lambda _1\ge \ldots \lambda _n>0$ be the eigenvalues of the Hessian matrix H. Then the stepsizes $\alpha _{k}^{YV}$, $\alpha _k^{Fmin}$ and $\alpha _k^{Fmax}$ are bounded by the inverse of the eigenvalues:

$$\begin{aligned} {\lambda _1}^{-1}\le \alpha _{k}^{YV}\le \alpha _k^{Fmin} \le \alpha _k^{Fmax}\le {\lambda _n}^{-1}. \end{aligned}$$

(12)

This can be verified by the fact that ${\lambda _1}^{-1} \le \alpha _k^* \le {\lambda _n}^{-1} $ holds for all k. Yet $\tilde{\alpha }_k$ is possibly smaller than ${\lambda _1}^{-1}$.

By using (7)–(10), the gradient method with stepsizes (6) gives four different algorithms. We call the new gradient methods as Alg. 1 to Alg. 4 corresponding to (7)-(10), respectively. The detailed framework of Alg. 1 is shown in Algorithm 1, and the frameworks for the other three algorithms are similar.

The proposed algorithms make full use of the computed Cauchy steps. In one loop of each proposed algorithm, we only compute two Cauchy steps in the first two iterations, and then easily construct the fixed step length $\alpha _{k-1}^F$ with simple calculation by using the two Cauchy steps. In this way, the computational cost is reduced compared to the state of the art such as SDC.

3 Theoretical analysis

In this section we analyze the convergence properties of the new gradient algorithms proposed in Sect. 2. The analysis for 2-dimensional cases and general n $(n>2)$ dimensional cases are considered individually.

3.1 Convergence analysis for 2-dimensional problems

Assume that the four proposed gradient algorithms are applied to problem (1) with $n=2$. We consider the special 2-dimensional case to show that the proposed algorithms are able to avoid the zigzag phenomenon caused by Cauchy steps.

First, we show that Alg. 1 has finite termination property.

Theorem 1

Apply the gradient method Alg. 1 to problem (1) with $n=2$. $x_0$ is any initial point. Suppose that $\lambda _{1}=\lambda>\lambda _{2}>0$ are the two eigenvalues of the Hessian matrix H of the objective function, and $d_1$ and $d_2$ are the corresponding orthonormal eigenvectors. The gradient vector $g_{k}=Hx_k+b$ can be represented as

$$\begin{aligned} g_k = \mu _1^{k} d_1 + \mu _2^{k} d_2 \end{aligned}$$

with $\mu _i^{k}= d_i^T g_k \, (i=1,2)$. If the initial point $x_0$ satisfies that $\mu _1^{0}\ne 0$ and $\mu _2^{0}\ne 0$, the method terminates in finite iterations.

Proof

Without loss of generality we assume that $d_1= (1,0 )^T$, $d_2= (0,1)^T$ and $\lambda _2 = 1$. In this case, $g_k = (\mu _1^{k}, \mu _2^{k})^T$ and

$$\begin{aligned} H=\left( \begin{array}{cc} \lambda &{} \quad 0 \\ 0 &{}\quad 1 \end{array} \right) . \end{aligned}$$

Suppose that $\mu _1^{k}\ne 0$. Otherwise the problem degenerates into one-dimensional subspace, and the algorithm will terminate after the next Cauchy step. Define

$$\begin{aligned} q_{k}=\frac{(\mu _2^{k})^2}{(\mu _1^{k})^2}. \end{aligned}$$

(13)

Since $g_{k+1}=(I-\alpha _kH)g_{k}$, we can establish that

$$\begin{aligned} \mu _1^{k+1}=(1-\lambda \alpha _k)\mu _1^{k},\mu _2^{k+1}=(1-\alpha _k)\mu _2^{k} \end{aligned}$$

(14)

and

$$\begin{aligned} \Vert g_{k+1}\Vert _2^2=\frac{1+q_{k+1}}{1+q_{k}}(1-\lambda \alpha _k)^2\Vert g_{k}\Vert _2^2. \end{aligned}$$

(15)

For Cauchy steps, we have

$$\begin{aligned} \alpha _{k}^{*}=\frac{\Vert g_{k}\Vert _2^2}{g_{k}^THg_{k}}=\frac{1+q_{k}}{\lambda +q_{k}} \end{aligned}$$

and

$$\begin{aligned} q_{k+1}=\frac{1}{q_{k}}. \end{aligned}$$

(16)

From the update strategy (6), we have $\alpha _0=\alpha _0^{*}=\frac{1+q_{0}}{\lambda +q_{0}}$ and $\alpha _1=\alpha _1^{*}=\frac{1+q_{1}}{\lambda +q_{1}}=\frac{1+q_{0}}{\lambda q_{0}+1}$. Introducing

$$\begin{aligned}&\tilde{\alpha }_1=\left( \frac{1}{\alpha _{0}^{*}}+\frac{1}{\alpha _1^{*}}\right) ^{-1}=\left( \frac{\lambda +q_{0}}{1+q_{0}}+\frac{\lambda q_{0}+1}{1+q_{0}}\right) ^{-1}=\frac{1}{\lambda +1},\\&\rho _1=\frac{1}{\alpha _{0}^{*}\alpha _1^{*}}-\frac{\Vert g_1\Vert _2^2}{(\alpha _{0}^{*}\Vert g_{0}\Vert _2)^2}=\frac{(\lambda +q_{0})(\lambda q_{0}+1)}{(1+q_{0})^2}-\frac{q_0(\lambda -1)^2}{(1+q_{0})^2}=\lambda , \end{aligned}$$

we can deduce that

$$\begin{aligned} \alpha _{2} =\alpha _{1}^{YV}=2(\sqrt{(\lambda +1)^2-4\lambda }+\lambda +1)^{-1}=\frac{1}{\lambda } \, . \end{aligned}$$

Then (14) implies that $\mu _{1}^3=0$. Consequently $\mu _{1}^k$ keeps 0 for $k\ge 3$. The gradient residue $\mu _{2}^k$ can be further eliminated by the exact linesearch in the $(m+1)$th iteration. Thus Alg. 1 terminates in at most $m+1$ iterations. $\square $

Next the other three proposed gradient algorithms are analyzed.

Theorem 2

Apply Alg. 2, Alg. 3 and Alg. 4 to problem (1) with $n=2$. Under the assumptions in Theorem 1, the three methods converges R-superlinearly.

Proof

In all the three methods, $\alpha _{mk}=\alpha _{mk}^{*}$, $\alpha _{mk+1}=\alpha _{mk+1}^{*}$ and $q_{mk+1}=q_{mk}^{-1}$.

For Alg. 2, due to (16), we have

$$\begin{aligned} \alpha _{mk+i}=\tilde{\alpha }_{mk+1}=\left( \frac{1}{\alpha _{mk}^{*}}+\frac{1}{\alpha _{mk+1}^{*}}\right) ^{-1}=\frac{1}{\lambda +1}, i=2,\ldots ,m-1. \end{aligned}$$

From (13) and (14), we can deduce that

$$\begin{aligned} q_{mk+i}=\left\{ \begin{array}{ll} \frac{1}{q_{mk+i-1}}, &{}\quad {\text {if}}\, i=1,2;\\ \lambda ^{2}q_{mk+i-1}, &{}\quad {\text {if}}\, 2< i\le m \end{array} \right. \end{aligned}$$

(17)

and thus conclude that

$$\begin{aligned} q_{mk}=\lambda ^{(2m-4)k}q_0. \end{aligned}$$

(18)

Taking (17) into (15), we have

$$\begin{aligned} \Vert g_{mk+m}\Vert _2^2=\frac{1+\lambda ^{2m-4}q_{mk}}{1+q_{mk}}\frac{(\lambda -1)^4}{(\lambda +1)^{2m-4}(\lambda +q_{mk}^{-1})^2(\lambda +q_{mk})^2}\Vert g_{mk}\Vert _2^2. \end{aligned}$$

For simplicity, define $q=q_{mk}$. From (18) we have $q>q_0$, with $\lambda >1$ and $m>2$. From $q_{mk+1}=\lambda ^{2m-4}q=\frac{q^2}{q_0}$, we have

$$\begin{aligned}&\frac{1+\lambda ^{2m-4}q}{1+q}\cdot \frac{1}{(\lambda +q^{-1})^2(\lambda +q)^2}\\&\quad =\frac{q_0+q^2}{q_0(1+q)}\cdot \frac{1}{[\lambda ^2+\lambda (q^{-1}+q)+1]^2}\\&\quad \le \frac{q_0+q^2}{q_0(1+q)}\cdot \frac{1}{\lambda ^2(q_0+q^2)}\le \lambda ^{-2}q_{0}^{-1}q^{-1}. \end{aligned}$$

Thus

$$\begin{aligned} \Vert g_{mk+m}\Vert _2^2\le C_1q_0^{-2}\lambda ^{-(2m-4)k}\Vert g_{mk}\Vert _2^2\le C_{1}^{k}q_0^{-2k}\lambda ^{-(m-2)(k^2+k)}\Vert g_{0}\Vert _2^2, \end{aligned}$$

where $C_1(\lambda ,m)=\frac{(\lambda -1)^4}{\lambda ^2(\lambda +1)^{2m-4}}$ is a constant. When k goes to infinity,

$$\begin{aligned} \root k \of {\Vert g_{mk+m}\Vert _2}\le \sqrt{C_1}\root k \of {\Vert g_{0}\Vert _2}q_{0}^{-1}\lambda ^{-\frac{m-2}{2}(k+1)}\rightarrow 0. \end{aligned}$$

The above analysis implies that Alg. 2 converges R-superlinearly.

Similarly, we can prove that Alg. 3 and Alg. 4 converges R-superlinearly.$\square $

3.2 Convergence analysis for n-dimensional problems

Besides the convergence results for 2-dimensional problems, we also analyze the convergence property of the proposed gradient algorithms for general n-dimensional problems.

Lemma 1

Suppose $d_1,\ldots ,d_n$ are the orthonormal eigenvectors corresponding to $\lambda _1,\ldots ,\lambda _n$. The proposed gradient methods are applied to problem (1). Suppose that the initial point $x_0$ satisfies $g_0^{T}d_1\ne 0$ and $g_0^{T}d_n\ne 0$. Let $g_k=\sum _{i=1}^{n}\mu _{i}^{k}d_i$ and define $h(k,l)=\sum _{i=l}^{n}(\mu _{i}^{k})^2$. Then all the proposed gradient methods satisfy the following two properties:

(i)
there exists a constant $M_1\ge \lambda _n$, such that $\lambda _n\le \alpha _{k}^{-1}\le M_1$ holds for all $k\ge 0$;
(ii)
if there exists a positive integer $k_0$ and a positive constant $M_2$, such that $h(k-j,l)<\epsilon $ and $(\mu _{l-1}^{k-j})^2<M_2\epsilon $ holds for any $j\in \{0,\ldots ,\min \{k,k_0\}\}$, then $\alpha _{k}^{-1}\ge \frac{2}{3}\lambda _{l-1}$.

Proof

First, we prove that all the four proposed algorithms satisfy property (i). By the framework (6), we only need to prove that both the Cauchy steps and $\alpha _k^{F}$ in each method satisfy property (i).

Since $\alpha _{k}^{*}=\frac{g_k^{T}g_k}{g_k^{T}Hg_k}$, it is trivial to show that

$$\begin{aligned} \lambda _n\le (\alpha _{k}^{*})^{-1}\le \lambda _1. \end{aligned}$$

(19)

For Alg. 1 and Alg. 2, we have from (11) that

$$\begin{aligned} \lambda _1\le \min \{(\alpha _{k}^{*})^{-1},(\alpha _{k-1}^{*})^{-1}\} \le (\alpha _{k}^{YV})^{-1}\le (\widetilde{\alpha }_k)^{-1}=\alpha _{k}^{-1}+\alpha _{k-1}^{-1}\le 2\lambda _n. \end{aligned}$$

Let $M_1=2\lambda _n$. We conclude that both algorithms satisfy property (i).

For Alg. 3 and Alg. 4, (19) holds, thus

$$\begin{aligned} \lambda _n\le \alpha _{k}^{Fmin}=\min \{(\alpha _{k}^{*})^{-1},(\alpha _{k-1}^{*})^{-1}\}\le \alpha _{k}^{Fmax}=\max \{(\alpha _{k}^{*})^{-1},(\alpha _{k-1}^{*})^{-1}\}\le \lambda _1. \end{aligned}$$

Let $M_1=\lambda _1$, and property (i) holds for both methods.

Next, we prove that property (ii) holds for all the four proposed algorithms.

For Alg. 1, Alg. 2, and Alg. 3, from (11), we have that

$$\begin{aligned} \widetilde{\alpha }_k\le \alpha _{k}^{YV}\le \alpha _{k}^{Fmin}\le \alpha _{k}^{*}. \end{aligned}$$

For the considered three algorithms, given k, if $\mod (k,m)\ge 2$, then $\alpha _{k}=\alpha _{r+1}=\alpha _{r}^{F}\le \alpha _{r}^{*}$ holds, where $r={\text {argmax}}\{i\le k|\alpha _{i}=\alpha _{i}^{*}\}$ is the iteration number of the last Cauchy step during all the k iterations. Thus $\alpha _{k}^{-1}\ge (\alpha _{r}^{*})^{-1}$. Let $k_0=m-2$ and $M_2=2$. Then $\max \{k-k_0,0\}\le r\le k$. We can deduce that

$$\begin{aligned} (\alpha _{r}^{*})^{-1}= & {} \frac{g_r^THg_r}{g_r^Tg_r}=\frac{\sum _{i=1}^{n}(\mu _{i}^{r})^2\lambda _{i}}{\sum _{i=1}^{n}(\mu _{i}^{r})^2} \ge \frac{\sum _{i=1}^{l-1}(\mu _{i}^{r})^2\lambda _{i}}{\sum _{i=1}^{l-1}(\mu _{i}^{r})^2+h(r,l)}\nonumber \\\ge & {} \frac{\lambda _{l-1}\sum _{i=1}^{l-1}(\mu _{i}^{r})^2}{\sum _{i=1}^{l-1}(\mu _{i}^{r})^2+\epsilon } =\frac{\lambda _{l-1}}{1+\frac{\epsilon }{\sum _{i=1}^{l-1}(\mu _{i}^{r})^2}}\nonumber \\\ge & {} \frac{\lambda _{l-1}}{1+\frac{\epsilon }{(\mu _{l-1}^{r})^2}}\ge \frac{\lambda _{l-1}}{1+\frac{\epsilon }{2\epsilon }}=\frac{2}{3}\lambda _{l-1}. \end{aligned}$$

(20)

Thus $\alpha _{k}^{-1}\ge (\alpha _{r}^{*})^{-1}\ge \frac{2}{3}\lambda _{l-1}$. The considered three algorithms satisfy property (ii).

For Alg. 4, we have

$$\begin{aligned} (\alpha _{k}^{Fmax})^{-1}\ge \min \{(\alpha _{k}^{*})^{-1},(\alpha _{k-1}^{*})^{-1}\}. \end{aligned}$$

Given k, $\alpha _{k}^{-1}\ge (\alpha _{r_{0}}^{*})^{-1}$ holds, where

$$\begin{aligned} r_{0}=\left\{ \begin{array}{ll} r, &{}\quad {\text {if}}\, (\alpha _{r}^{*})^{-1}\le (\alpha _{r-1}^{*})^{-1};\\ r-1, &{}\quad {\text {otherwise}} \end{array} \right. \end{aligned}$$

and $r={\text {argmax}}\{i\le k|\alpha _{i}=\alpha _{i}^{*}\}$. Let $k_0=m-1$ and $M_2=2$. Then $\max \{k-k_0,0\}\le r_{0}\le k$. Similar to (20) we can deduce that $\alpha _{k}^{-1}\ge (\alpha _{r_{0}}^{*})^{-1}\ge \frac{2}{3}\lambda _{l-1}$. Thus we conclude that property (ii) also holds for this algorithm.$\square $

Theorem 3

[4, Theorem 4.1] Consider the same assumptions in Lemma 1. As long as the stepsizes satisfy property (i) and (ii), the gradient method either terminates in finite iterations or converges R-linearly.

From the above theorem, we conclude that all the four proposed gradient algorithms converge R-linearly when they are applied to n-dimensional convex quadratic functions.

4 Numerical Results

In this section, we show the performances of the proposed algorithms. The algorithms are compared with several gradient methods mentioned in Sect. 1. For all the generated test problems, the Hessian matrix is set as a diagonal matrix, where $H=Diag(\lambda _1,\ldots ,\lambda _n)$. For each setting, we generate 10 examples, and show the average iteration numbers as well as the computational time. All the tests are done on Matlab R2014a, Intel core i7-6700HQ.

For Alg. 1, two different versions are implemented. In version A, we use $\alpha _k^{YV}$ computed by (4); while in version B, we replace $\alpha _k^{YV}$ by

$$\begin{aligned} \alpha _k^{New} = \tilde{\alpha }_k [ 1 + \rho _k (\alpha _k^{YV} )^2 ] \, , \end{aligned}$$

where $\rho _k=\frac{1}{\alpha _{k-1}^{*}\alpha _k^{*}}-\frac{\Vert g_k\Vert _2^2}{(\alpha _{k-1}^{*}\Vert g_{k-1}\Vert _2)^2}$. Theoretically $\alpha _k^{New} = \alpha _k^{YV}$, but computationally they give slightly different numerical performances. Here we report the better results between the two versions.

We consider four kinds of test problems. Given dimension n and condition number $\kappa $, their detailed settings are as follows:

Test 1. [10] In the Hessian matrix, we set $\lambda _1=\kappa $, $\lambda _n=1$; $\lambda _j$ are randomly generated in $[1,\kappa ]$, for $j=2,\ldots ,n-1$; b is randomly generated in $[-5,5]^n$. The stopping parameter $\epsilon =10^{-8}\Vert g_0\Vert $.
Test 2. [10, 12] Let $\lambda _j=\kappa ^{\frac{n-j}{n-1}}$, for $j=1,\ldots ,n$, and b be a zero vector. The initial point $x_0$ is randomly generated in $[-5,5]^n$.
Test 3. [12] Let $\lambda _j=1+(\kappa -1)s_j$ and b be a zero vector. $s_j$ is randomly generated from [0.8, 1] for $j=1,\ldots ,\frac{n}{2}$, and randomly generated from [0, 0.2] for $j=\frac{n}{2}+1,\ldots ,n$. The stopping parameter $\epsilon =10^{-6}$. The initial point $x_0$ is randomly generated on a unit sphere.
Test 4. [16] Let $\lambda _j=\frac{\kappa }{2}(\cos \frac{n-j}{n-1}\pi +1)$ and b be a zero vector. The stopping parameter $\epsilon =10^{-6}$. The initial point $x_0$ is randomly generated on a unit sphere.

First, the property of the proposed algorithms are explored. Fig. 1 depicts the average iteration numbers of the considered four algorithms with respect to different values of the parameter m, where “Test 2” is solved. Here $n=10^3$, $\kappa =10^4$ and the stopping parameter $\epsilon =10^{-6}$. It shows that Alg. 1 and Alg. 3 with $m=10$ outperform the other settings. We also plot the iteration curves representing function values of the four algorithms in Fig. 2. “Test 1” is considered, with $n=10^3$, $\kappa =10^4$, $m=10$ and $x_0$ randomly generated in $[-5,5]^n$. It indicates that all the four algorithms are nonmonotonic. The good performances of Alg. 1 and Alg. 3 are observed again.

Table 1 Comparison with the existing methods for different test problems

Full size table

Next, we test some problems with various dimensions and condition numbers. Here we compare the proposed algorithms with the effective gradient methods analyzed in [12], such as SDC, ${\text {ABB}}_{\min }$ and LMSD. Let $m=10$. The parameters in SDC, ${\text {ABB}}_{\min }$ and LMSD are the same as in [12]. In “Test 1”, $n=10^4$, $\kappa =10^4$ and $x_0$ is randomly generated in $[-5,5]^n$; in “Test 2”, $n=10^4$, $\kappa =10^6$ and $\epsilon =10^{-8}\Vert g_0\Vert $; in “Test 3”, $n=10^3$ and $\kappa =10^3$; in “Test 4”, $n=10^3$ and $\kappa =10^5$. Table 1 shows the iteration numbers as well as the computational time of the compared methods for the four kinds of test problems. The best results are marked in boldface type. It turns out that Alg. 1 and Alg. 3 outperform the compared algorithms generally, in terms of computational time. Due to the fixed step lengths in each loop, the proposed algorithms have lower per iteration computational cost^{Footnote 1}. Consequently they require lower total computational cost compared to the state of the art. But Alg. 2 and Alg. 4 are overperformed by the compared algorithms. The reason may be that the stepsize (8) in Alg. 2 is sometimes too short, and that (10) in Alg. 4 is sometimes too long. From the four tests, we can also observe that the computational complexity of all the considered gradient methods increases with the increment of the dimension and the condition number.

With various types of problems and different settings, the above numerical tests indicate the superior performances of the proposed algorithms Alg. 1 and Alg. 3.

5 Conclusion

In this paper, we proposed four gradient methods for convex quadratic programming. First, a new framework to update the stepsizes was proposed. In each outer loop, only 2 exact linesearch steps are calculated and the other $m-2$ iterations use the fixed steplengths. Four different stepsizes were applied in the new framework, and thus four new gradient algorithms were proposed. Theoretically, for 2-dimensional convex quadratic function minimization problems, we proved that the proposed algorithms have either finite terminations or R-superlinear convergence rate. For general n-dimensional problems, we proved that all new algorithms converge R-linearly. Numerical results indicate the effectiveness and the robustness of the proposed Alg. 1 and Alg. 3.

Notes

For example, in this test problem, the per iteration computational cost of ${\text {ABB}}_{\min }$ is $n^2+3n+\frac{5}{2}$, while that of Alg. 3 is $\frac{1}{5}n^2+\frac{2}{5}n+\frac{3}{10}$.

References

Barzilai, J., Borwein, J.M.: Two point step size gradient methods. IMA J. Numer. Anal. 8, 141–148 (1988)
Article MathSciNet MATH Google Scholar
Cauchy, A.: Méthode généerale pour la résolution des systems d’équations simultanées. Comp. Rend. Sci. Paris 25, 46–89 (1847)
Google Scholar
Curry, H.B.: The method of steepest descent for nonlinear minimization problems. Q. Appl. Math. 2, 258–261 (1944)
Article MATH Google Scholar
Dai, Y.-H.: Alternate step gradient method. Optimization 52(4–5), 395–415 (2003)
Article MathSciNet MATH Google Scholar
Dai, Y.-H.: A new analysis on the Barzilai–Borwein gradient method. J. Oper. Res. Soc. China 1(2), 187–198 (2013)
Article MathSciNet MATH Google Scholar
Dai, Y.-H., Fletcher, R.: On the asymptotic behaviour of some new gradient methods. Math. Program. 103(3), 541–559 (2005)
Article MathSciNet MATH Google Scholar
Dai, Y.-H., Yuan, Y.-X.: Analysis of monotone gradient methods. J. Ind. Manag. Optim. 1, 181–192 (2005)
MathSciNet MATH Google Scholar
Dai, Y.-H., Liao, L.-Z.: $R$-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22, 1–10 (2002)
MathSciNet MATH Google Scholar
De Asmundis, R., di Serafino, D., Riccio, F., Toraldo, G.: On spectral properties of steepest descent methods. IMA J. Numer. Anal. 33(4), 1416–1435 (2013)
Article MathSciNet MATH Google Scholar
De Asmundis, R., di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.-C.: An efficient gradient method using the Yuan steplength. Comp. Opt. Appl. 59(3), 541–563 (2014)
Article MathSciNet MATH Google Scholar
de Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact linesearch for smooth strongly convex functions. Optim. Lett. 11, 1185–1199 (2017)
Article MathSciNet MATH Google Scholar
di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
MathSciNet MATH Google Scholar
Fletcher, R.: A limited memory steepest descent method. Math. Program. Ser. A 135, 413–436 (2012)
Article MathSciNet MATH Google Scholar
Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Manag. Optim. 4(2), 299–312 (2008)
MathSciNet MATH Google Scholar
Friedlander, A., Martinez, J.M., Molina, B., Raydan, M.: Gradient method with retards and generalizations. SIAM J. Numer. Anal. 36, 275–289 (1999)
Article MathSciNet MATH Google Scholar
Gonzaga, C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comput. Optim. Appl. 63(2), 523–542 (2016)
Article MathSciNet MATH Google Scholar
Nocedal, J., Sartenaer, A., Zhu, C.: On the behavior of the gradient norm in the steepest descent method. Comput. Optim. Appl. 22(1), 5–35 (2002)
Article MathSciNet MATH Google Scholar
Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7, 26–33 (1997)
Article MathSciNet MATH Google Scholar
Raydan, M., Svaiter, B.F.: Relaxed steepest descent and Cauchy–Barzilai–Borwein method. Comput. Optim. Appl. 21, 155–167 (2002)
Article MathSciNet MATH Google Scholar
Vrahatis, M.N., Androulakis, G.S., Lambrinos, J.N., Magoulas, G.D.: A class of gradient unconstrained minimization algorithms with adaptive step-size. J. Comput. Appl. Math. 114(2), 367–386 (2000)
Article MathSciNet MATH Google Scholar
Yuan, Y.-X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
MathSciNet MATH Google Scholar
Yuan, Y.-X.: Step-sizes for the gradient method. AMS IP Stud. Adv. Math. 42(2), 785–796 (2008)
MathSciNet MATH Google Scholar
Yuan, Y.-X.: A short note on the Q-linear convergence of the steepest descent method. Math. Program. 123, 339–343 (2010)
Article MathSciNet MATH Google Scholar
Yuan, Y.-X.: Gradient methods for large scale convex quadratic functions. In: Wang, Y., Yang, C., Yagola, A.G. (eds.) Optimization and regularization for computational inverse problems and applications, pp. 141–155. Springer, Berlin (2010)
Chapter Google Scholar
Zheng, Y., Zheng, B.: A new modified Barzilai–Borwein gradient method for the quadratic minimization problem. J. Optim. Theory Appl. 172(1), 179–186 (2017)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank Prof. Ya-xiang Yuan from Chinese Academy of Sciences, the editor and the two anonymous reviewers for their valuable suggestions and comments.

Author information

Jin-Peng Liu
Present address: Department of Mathematics, University of Maryland, College Park, MD, 20742, USA

Authors and Affiliations

School of Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Cong Sun
School of Mathematics and Systems Science, Beihang University, Beijing, China
Jin-Peng Liu

Authors

Cong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jin-Peng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cong Sun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is partially supported by NSFC Grants 11771056, 91630202 and 11871115, and the Young Elite Scientists Sponsorship Program by CAST 2017QNRC001.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, C., Liu, JP. New stepsizes for the gradient method. Optim Lett 14, 1943–1955 (2020). https://doi.org/10.1007/s11590-019-01512-y

Download citation

Received: 15 January 2018
Accepted: 22 November 2019
Published: 28 November 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11590-019-01512-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

New stepsizes for the gradient method

Abstract

Similar content being viewed by others

An efficient gradient method with approximate optimal stepsize for large-scale unconstrained optimization

A conjugate gradient method with sufficient descent property

On the acceleration of the Barzilai–Borwein method

1 Introduction