A family of spectral gradient methods for optimization

Dai, Yu-Hong; Huang, Yakui; Liu, Xin-Wei

doi:10.1007/s10589-019-00107-8

A family of spectral gradient methods for optimization

Published: 06 May 2019

Volume 74, pages 43–65, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Optimization and Applications Aims and scope Submit manuscript

A family of spectral gradient methods for optimization

Download PDF

998 Accesses
35 Citations
Explore all metrics

Abstract

We propose a family of spectral gradient methods, whose stepsize is determined by a convex combination of the long Barzilai–Borwein (BB) stepsize and the short BB stepsize. Each member of the family is shown to share certain quasi-Newton property in the sense of least squares. The family also includes some other gradient methods as its special cases. We prove that the family of methods is R-superlinearly convergent for two-dimensional strictly convex quadratics. Moreover, the family is R-linearly convergent in the any-dimensional case. Numerical results of the family with different settings are presented, which demonstrate that the proposed family is promising.

A family of optimal weighted conjugate-gradient-type methods for strictly convex quadratic minimization

Article 25 November 2021

Two spectral conjugate gradient methods for unconstrained optimization problems

Article 11 April 2022

A Modified Spectral Conjugate Gradient Method with Global Convergence

Article 06 April 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consider the unconstrained optimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^n}~~f(x), \end{aligned}$$

(1)

where $f(x): \mathbb {R}^n\rightarrow \mathbb {R}$ is a continuously differentiable function. The gradient method solves problem (1) by updating iterates as

$$\begin{aligned} x_{k+1}=x_k-\alpha _kg_k, \end{aligned}$$

(2)

where $g_k=\nabla f(x_k)$ and $\alpha _k>0$ is the stepsize. Different gradient methods use different formulae for stepsizes.

The simplest gradient method is the steepest descent (SD) method due to Cauchy [6], which computes the stepsize by exact line search,

$$\begin{aligned} \alpha _k^{SD}=\arg \min _{\alpha \in \mathbb {R}}~f(x_k-\alpha g_k). \end{aligned}$$

As is well known, every two consecutive gradients generated by the SD method are perpendicular to each other. Moreover, if f(x) is a strictly convex quadratic function, i.e.,

$$\begin{aligned} f(x)=\frac{1}{2}x^TAx-b^Tx, \end{aligned}$$

(3)

where $A\in \mathbb {R}^{n\times n}$ is symmetric positive definite and $b\in \mathbb {R}^n$, it can be shown that the gradients will asymptotically reduce to a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the matrix A and hence zigzag occurs, see [1, 32] for more details. This property seriously deteriorates the performance of the SD method, especially when the condition number of A is large.

An important approach that changes our perspectives on the effectiveness of gradient methods is proposed by Barzilai and Borwein [2]. They viewed the updating rule (2) as

$$\begin{aligned} x_{k+1}=x_k-D_kg_k, \end{aligned}$$

(4)

where $D_k=\alpha _kI$. Similar to the quasi-Newton method [19], $D_k^{-1}$ is required to satisfy the secant equation

$$\begin{aligned} B_ks_{k-1}=y_{k-1} \end{aligned}$$

(5)

to approximate the Hessian as possible as it can. Here, $s_{k-1}=x_k-x_{k-1}$ and $y_{k-1}=g_k-g_{k-1}$. However, since $D_k$ is diagonal with identical diagonal elements, it is usually impossible to find an $\alpha _k$ such that $D_k^{-1}$ fulfills (5) if the dimension $n>1$. Thus, Barzilai and Borwein required $D_k^{-1}$ to meet the secant equation in the sense of least squares,

$$\begin{aligned} D_k=\arg \min _{D=\alpha I}~\Vert D^{-1}s_{k-1}-y_{k-1}\Vert , \end{aligned}$$

(6)

which yields

$$\begin{aligned} \alpha _k^{BB1}=\frac{s_{k-1}^Ts_{k-1}}{s_{k-1}^Ty_{k-1}}. \end{aligned}$$

(7)

Here and below, $\Vert \cdot \Vert $ means the Euclidean norm. On the other hand, one can also calculate the stepsize by requiring $D_k$ to satisfy

$$\begin{aligned} H_ky_{k-1}=s_{k-1}. \end{aligned}$$

(8)

That is,

$$\begin{aligned} D_k=\arg \min _{D=\alpha I}~\Vert s_{k-1}-Dy_{k-1}\Vert , \end{aligned}$$

(9)

which gives

$$\begin{aligned} \alpha _k^{BB2}=\frac{s_{k-1}^Ty_{k-1}}{y_{k-1}^Ty_{k-1}}. \end{aligned}$$

(10)

Apparently, when $s_{k-1}^Ty_{k-1}>0$, there holds $\alpha _{k}^{BB1}\ge \alpha _{k}^{BB2}$. In other words, $\alpha _{k}^{BB1}$ is a long stepsize while $\alpha _{k}^{BB2}$ is a short one, which implies that $\alpha _{k}^{BB1}$ is more aggressive than $\alpha _{k}^{BB2}$ in decreasing the objective value. Extensive numerical experiments show that the long stepsize is superior to the short one in many cases, see [4, 22, 34] for example. In what follows we will refer to the gradient method with the long stepsize $\alpha _{k}^{BB1}$ as the BB method without specification.

Barzilai and Borwein [2] proved their method with the short BB stepsize $\alpha _{k}^{BB2}$ is R-superlinearly convergent for two-dimensional strictly convex quadratics. An improved R-superlinear convergence result for the BB method was given by Dai [8]. Global and R-linear convergence of the BB method for general n-dimensional strictly convex quadratics were established by Raydan [33] and Dai and Liao [14], respectively. The BB method has also been extended to solve general nonlinear optimization problems. By incorporating the nonmontone line search proposed by Grippo et al. [26], Raydan [34] developed the global BB method for general unconstrained problems. Later, Birgin et al. [3] proposed the so-called spectral projected gradient method which extends Raydan’s method to smooth convex constrained problems. Dai and Fletcher [11] designed projected BB methods for large-scale box-constrained quadratic programming. Recently, by resorting to the smoothing techniques, Huang and Liu [27] generalized the projected BB method with modifications to solve non-Lipschitz optimization problems.

The relationship between the stepsizes in BB-like methods and the spectrum of the Hessian of the objective function has been explored in several studies. Frassoldati et al. [23] tried to exploit the long BB stepsize close to the reciprocal of the smallest eigenvalue of the Hessian, yielding the ABBmin1 and ABBmin2 methods. De Asmundis et al. [18] developed the so-called SDA method which employs a short stepsize approximates the reciprocal of the largest eigenvalue of the Hessian. Following the line of [18], Gonzaga and Schneider [25] suggested a monotone method for quadratics where the stepsizes are obtained in a way similar to the SD method. De Asmundis et al. [17] proposed the SDC method which exploits the spectral property of Yuan’s stepsize [16]. Kalousek [30] considered the SD method with random stepsizes lying between the reciprocal of the largest eigenvalue and the smallest eigenvalue of the Hessian and analysed the optimal distribution of random stepsizes that guarantees the maximum asymptotic convergence rate.

Applications of the BB method and its variants have largely been developed for problems arising in various different areas including image restoration [37], signal processing [31], eigenvalue problems [29], nonnegative matrix factorization [28], sparse reconstruction [38], machine learning [36], etc. We refer the reader to [4, 7, 13, 20, 22, 39] and references therein for more spectral gradient methods and extensions.

The success of the BB method and its variants motivates us to consider spectral gradient methods. Our goal is to present a family of spectral gradient methods for optimization. Notice that the Broyden class of quasi-Newton methods [5] approximate the inverse of the Hessian by

$$\begin{aligned} H_{k}^\tau =\tau H_k^{BFGS} + (1-\tau )H_k^{DFP}, \end{aligned}$$

(11)

where $\tau \in [0,1]$ is a scalar parameter and $H_k^{BFGS}$ and $H_k^{DFP}$ are the BFGS and DFP matrices, respectively, that satisfy the secant equation (8), which further implies that

$$\begin{aligned} \tau H_k^{BFGS}y_{k-1} + (1-\tau )H_k^{DFP}y_{k-1}=s_{k-1}, \end{aligned}$$

i.e.,

$$\begin{aligned} \tau (H_k^{BFGS}y_{k-1}-s_{k-1}) + (1-\tau )(H_k^{DFP}y_{k-1}-s_{k-1})=0. \end{aligned}$$

(12)

Since the inverse of $H_k^{BFGS}$, say $B_k^{BFGS}$, satisfies (5), we can modify (12) as

$$\begin{aligned} \tau (B_k^{BFGS}s_{k-1}-y_{k-1}) + (1-\tau )(s_{k-1}-H_k^{DFP}y_{k-1})=0. \end{aligned}$$

(13)

Motivated by the above observation, we employ the idea of the BB method to approximate the Hessian and its inverse by diagonal matrices. Particularly, we require the matrix $D=\alpha I$ to be the solution of

$$\begin{aligned} \min _{D=\alpha I}~\Vert \tau (D^{-1}s_{k-1}-y_{k-1})+(1-\tau )(s_{k-1}-D y_{k-1})\Vert . \end{aligned}$$

(14)

In the next section, we will show that the stepsize given by the convex combination of the long BB stepsize $\alpha _k^{BB1}$ and the short BB stepsize $\alpha _k^{BB2}$, i.e.,

$$\begin{aligned} \alpha _k=\gamma _k \alpha _{k}^{BB1}+(1-\gamma _k)\alpha _{k}^{BB2}, \end{aligned}$$

(15)

where $\gamma _k\in [0,1]$, is a solution to (14). Clearly, this is a one-parametric family of stepsizes, which include the two BB stepsizes as special instances. Moreover, any stepsize lies in the interval $[\alpha _{k}^{BB2},\alpha _{k}^{BB1}]$ is a special case of the family. For example, the positive stepsize given by the geometrical mean of $\alpha _{k}^{BB1}$ and $\alpha _{k}^{BB2}$ [9],

$$\begin{aligned} \alpha _k^{P}=\sqrt{\alpha _{k}^{BB1}\alpha _{k}^{BB2}}=\frac{\Vert s_{k-1}\Vert }{\Vert y_{k-1}\Vert }. \end{aligned}$$

(16)

We further prove that the family of spectral gradient methods (15) is R-superlinearly convergent for two-dimensional strictly convex quadratics. For the n-dimensional case, the family is proved to be R-linearly convergent. Numerical results of the family (15) with different settings of $\gamma _k$ are presented and compared with other gradient methods, including the BB method [2], the alternate BB method (ALBB) [11], the adaptive BB method (ABB) [40], the cyclic BB method with stepsize $\alpha _k^{BB1}$ (CBB1) [12], the cyclic BB method with stepsize $\alpha _k^{BB2}$ (CBB2), the cyclic method with stepsize $\alpha _k^{P}$ (CP), the Dai-Yuan method (DY) [16], the ABBmin1 and ABBmin2 methods [23], and the SDC method [17]. The comparisons demonstrate that the proposed family is promising.

The paper is organized as follows. In Sect. 2, we show that each stepsize in the family (15) solves some least squares problem (14) and hence possesses certain quasi-Newton property. In Sect. 3, we establish R-superlinear convergence of the family (15) for two-dimensional strictly convex quadratics and R-linear convergence for the n-dimensional case, respectively. In Sect. 4, we discuss different selection rules for the parameter $\gamma _k$. In Sect. 5, we conduct some numerical comparisons of our approach and other gradient methods. Finally, some conclusions are drawn in Sect. 6.

2 Quasi-Newton property of the family (15)

In this section, we show that each stepsize in the family (15) enjoys certain quasi-Newton property.

For the sake of simplicity, we discard the subscript of $s_{k-1}$ and $y_{k-1}$ in the following part of this section, i.e., $s=s_{k-1}$, $y=y_{k-1}$. Let

$$\begin{aligned} \phi _\tau (\alpha ):{=}\left\| \tau \left( \frac{1}{\alpha }s{-}y\right) {+}(1{-}\tau )(s{-}\alpha y)\right\| ^2. \end{aligned}$$

Then, the derivative of $\phi _\tau (\alpha )$ with respect to $\alpha $ is

$$\begin{aligned} \phi _\tau '(\alpha )&=2[\tau +(1-\tau )\alpha ]\left\{ \left( -\frac{\tau }{\alpha ^3}\right) s^Ts -\left[ (1-\tau )\frac{1}{\alpha } +\left( -\frac{\tau }{\alpha ^2}\right) \right] s^Ty+(1-\tau )y^Ty\right\} . \end{aligned}$$

Proposition 1

If $s^Ty>0$ and $\tau \in [0,1]$, the equation $\phi '_\tau (\alpha )=0$ has a unique root in $[\alpha _k^{BB2},\alpha _k^{BB1}]$.

Proof

We only need to consider the case $\tau \in (0,1)$. Notice that

$$\begin{aligned} \psi (\tau ,\alpha ):&=\frac{1}{2}\frac{\alpha ^3}{\tau +(1-\tau )\alpha }\phi _\tau '(\alpha ) \nonumber \\&=-\tau s^Ts-[(1-\tau )\alpha ^2-\tau \alpha ]s^Ty+(1-\tau )\alpha ^3y^Ty\nonumber \\&=(1-\tau )(\alpha ^3y^Ty-\alpha ^2s^Ty)+\tau (\alpha s^Ty-s^Ts)\nonumber \\&=(1-\tau )y^Ty(\alpha ^3-\alpha ^2\alpha _k^{BB2})+\tau s^Ty(\alpha -\alpha _k^{BB1}). \end{aligned}$$

(17)

If $s^Ty>0$, we have $y\ne 0$ and $y^Ty>0$. This implies that $\psi (\tau ,\alpha _k^{BB2})<0$ and $\psi (\tau ,\alpha _k^{BB1})>0$. Thus, $\psi (\tau ,\alpha )=0$ has a root in $(\alpha _k^{BB2},\alpha _k^{BB1})$. Since $\alpha >0$, we know that the equation $\phi '_\tau (\alpha )=0$ has a root in $[\alpha _k^{BB2},\alpha _k^{BB1}]$.

Now we show the uniqueness of such a root by contradiction. Suppose that $\alpha _1<\alpha _2$ and $\alpha _1,\alpha _2\in [\alpha _k^{BB2},\alpha _k^{BB1}]$ such that $\phi '_\tau (\alpha _1)=0$ and $\phi '_\tau (\alpha _2)=0$. It follows from (17) that

$$\begin{aligned}&(1-\tau )y^Ty(\alpha _1^3-\alpha _1^2\alpha _k^{BB2})+\tau s^Ty(\alpha _1-\alpha _k^{BB1}) \\&\quad =(1-\tau )y^Ty(\alpha _2^3-\alpha _2^2\alpha _k^{BB2})+\tau s^Ty(\alpha _2-\alpha _k^{BB1}). \end{aligned}$$

By rearranging terms, we obtain

$$\begin{aligned}&(1-\tau )y^Ty[(\alpha _1-\alpha _2)(\alpha _1^2+\alpha _1\alpha _2+\alpha _2^2) -(\alpha _1-\alpha _2)(\alpha _1+\alpha _2)\alpha _k^{BB2}]\\&\quad =\tau s^Ty(\alpha _2-\alpha _1). \end{aligned}$$

Since $\alpha _1\ne \alpha _2$, it follows that

$$\begin{aligned} (1-\tau )y^Ty[(\alpha _1^2+\alpha _1\alpha _2+\alpha _2^2) -(\alpha _1+\alpha _2)\alpha _k^{BB2}]=-\tau s^Ty, \end{aligned}$$

which gives $(\alpha _1^2+\alpha _1\alpha _2+\alpha _2^2) -(\alpha _1+\alpha _2)\alpha _k^{BB2}<0$. This is not possible since $\alpha _k^{BB2}\le \alpha _1<\alpha _2$. This completes the proof. $\square $

Proposition 2

If $s^Ty>0$ and $\tau \in [0,1]$, the root of $\phi '_\tau (\alpha )=0$ in $[\alpha _k^{BB2},\alpha _k^{BB1}]$ is monotone with respect to $\tau $.

Proof

It suffices to show the statement holds for $\tau \in (0,1)$. By the proof of Proposition 1, $\alpha $ is an implicit function of $\tau $ determined by the equation $\psi (\tau ,\alpha )=0$. The derivative of $\psi (\tau ,\alpha )$ with respect to $\tau $ is

$$\begin{aligned} \frac{\partial \psi (\tau ,\alpha )}{\partial \tau }=&-y^Ty(\alpha ^3-\alpha ^2\alpha _k^{BB2}) + (1-\tau )y^Ty(3\alpha ^2\cdot \alpha '-2\alpha _k^{BB2}\alpha \cdot \alpha ')\\&+\, s^Ty(\alpha -\alpha _k^{BB1})+ \tau s^Ty\alpha '. \end{aligned}$$

Let $\frac{\partial \psi (\tau ,\alpha )}{\partial \tau }=0$. By simple calculations, we obtain

$$\begin{aligned} \alpha '=\frac{y^Ty(\alpha ^3-\alpha ^2\alpha _k^{BB2})-s^Ty(\alpha -\alpha _k^{BB1})}{(1-\tau )y^Ty(3\alpha ^2-2\alpha _k^{BB2}\alpha )+ \tau s^Ty}. \end{aligned}$$

For $\alpha \in (\alpha _k^{BB2},\alpha _k^{BB1})$, $\alpha '>0$. This completes the proof. $\square $

Theorem 1

For each $\gamma _k\in [0,1]$, the stepsize $\alpha _k$ defined by (15) is a solution of (14).

Proof

We only need to show that, for $\gamma _k\in (0,1)$, $\phi '_{\tau }(\alpha _k)$ vanishes at some $\tilde{\tau }\in (0,1)$. From (17) and (15), we have

$$\begin{aligned} \psi (\tau ,\alpha _k)&=(1-\tau )\alpha _k^2y^Ty(\alpha _k-\alpha _k^{BB2})+\tau s^Ty(\alpha _k-\alpha _k^{BB1})\\&=(1-\tau )\gamma _k\alpha _k^2y^Ty(\alpha _k^{BB1}-\alpha _k^{BB2})+\tau (\gamma _k-1) s^Ty(\alpha _k^{BB1}-\alpha _k^{BB2})\\&=y^Ty[(1-\tau )\gamma _k\alpha _k^2+\tau (\gamma _k-1)\alpha _k^{BB2}](\alpha _k^{BB1}-\alpha _k^{BB2}). \end{aligned}$$

Clearly,

$$\begin{aligned} \tilde{\tau }=\frac{\gamma _k\alpha _k^2}{\gamma _k\alpha _k^2+(1-\gamma _k)\alpha _k^{BB2}}\in (0,1) \end{aligned}$$

is a root of $\psi (\tau ,\alpha _k)=0$. This completes the proof. $\square $

3 Convergence analysis

In this section, we analyze the convergence properties of the family (15) for the quadratic function (3). Since the gradient method (2) is invariant under translations and rotations when applying to problem (3), we assume that the matrix A is diagonal, i.e.,

$$\begin{aligned} A=\text {diag}\{\lambda _1,\lambda _2,\cdots ,\lambda _n\}, \end{aligned}$$

(18)

where $0<\lambda _1\le \lambda _2\le \cdots \le \lambda _n$.

3.1 Two-dimensional case

In this subsection, based on the techniques in [9], we establish the R-superlinear convergence of the family (15) for two-dimensional quadratic functions.

Without loss of generality, we assume that

$$\begin{aligned} A=\left( \begin{array}{cc} 1 &{} ~0 \\ 0 &{} ~\lambda \\ \end{array} \right) ,~~b=0, \end{aligned}$$

where $\lambda \ge 1$. Furthermore, assume that $x_1$ and $x_2$ are such that

$$\begin{aligned} g_1^{(i)}\ne 0,~~g_2^{(i)}\ne 0,~~i=1,2. \end{aligned}$$

(19)

Let

$$\begin{aligned} q_k=\frac{(g_k^{(1)})^2}{(g_k^{(2)})^2}. \end{aligned}$$

Then it follows that

$$\begin{aligned}&\alpha _{k}^{BB1} =\frac{g_{k-1}^Tg_{k-1}}{g_{k-1}^TAg_{k-1}} =\frac{1+q_{k-1}}{\lambda +q_{k-1}},\\&\alpha _{k}^{BB2} =\frac{g_{k-1}^TAg_{k-1}}{g_{k-1}^TA^2g_{k-1}} =\frac{\lambda +q_{k-1}}{\lambda ^2+q_{k-1}}. \end{aligned}$$

Thus, the stepsize (15) can be written as

$$\begin{aligned} \alpha _k&=\gamma _k\frac{1+q_{k-1}}{\lambda +q_{k-1}}+ (1-\gamma _k)\frac{\lambda +q_{k-1}}{\lambda ^2+q_{k-1}}\nonumber \\&=\frac{\gamma _k(1+q_{k-1})(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})^2}{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})}. \end{aligned}$$

(20)

By the update rule (2) and $g_k=Ax_k$, we have

$$\begin{aligned} g_{k+1}=(I-\alpha _kA)g_k. \end{aligned}$$

Thus,

$$\begin{aligned} (g_{k+1}^{(1)})^2&=(1-\alpha _k)^2(g_k^{(1)})^2\nonumber \\&=\frac{(\lambda -1)^2\left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)\lambda (\lambda +q_{k-1})\right] ^2}{(\lambda +q_{k-1})^2(\lambda ^2+q_{k-1})^2} (g_k^{(1)})^2, \end{aligned}$$

(21)

$$\begin{aligned} (g_{k+1}^{(2)})^2&=(1-\lambda \alpha _k)^2(g_k^{(2)})^2\nonumber \\&=\frac{q_{k-1}^2(1-\lambda )^2 \left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})\right] ^2}{(\lambda +q_{k-1})^2(\lambda ^2+q_{k-1})^2} (g_k^{(2)})^2. \end{aligned}$$

(22)

From (21), (22) and the definition of $q_k$, we get

$$\begin{aligned} q_{k+1}=\frac{(g_{k+1}^{(1)})^2}{(g_{k+1}^{(2)})^2} =\left( \frac{\gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)\lambda (\lambda +q_{k-1})}{\gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})}\right) ^2\frac{q_k}{q_{k-1}^2}. \end{aligned}$$

(23)

Let

$$\begin{aligned} h_k(w)=\frac{\gamma _k(\lambda ^2+w)+ (1-\gamma _k)\lambda (\lambda +w)}{\gamma _k(\lambda ^2+w)+ (1-\gamma _k)(\lambda +w)}. \end{aligned}$$

Then we have

$$\begin{aligned}&h_k(0)=\frac{\gamma _k\lambda ^2+ (1-\gamma _k)\lambda ^2}{\gamma _k\lambda ^2+ (1-\gamma _k)\lambda }=\frac{\lambda }{\gamma _k\lambda +1-\gamma _k},\\&\lim _{w\rightarrow \infty }h_k(w)=\frac{\gamma _k+ (1-\gamma _k)\lambda }{\gamma _k+ 1-\gamma _k}=\gamma _k+(1-\gamma _k)\lambda . \end{aligned}$$

Since

$$\begin{aligned} h'_k(w)= \frac{\gamma _k(1-\gamma _k)\lambda (\lambda -1)^2}{(\gamma _k(\lambda ^2+w)+ (1-\gamma _k)(\lambda +w))^2}, \end{aligned}$$

(24)

we obtain $h'(w)>0$ for $\gamma _k\in (0,1)$. Thus,

$$\begin{aligned} h_k(w)\in \left( \frac{\lambda }{\gamma _k\lambda +1-\gamma _k},\gamma _k+(1-\gamma _k)\lambda \right) . \end{aligned}$$

(25)

Denoting $M_k=\log q_k$. By (23), we have

$$\begin{aligned} M_{k+1}=M_k-2M_{k-1}+2\log h_k(q_{k-1}). \end{aligned}$$

(26)

Let $\theta $ such that $\theta ^2-\theta +2=0$. Then, $\theta =\frac{1\pm \sqrt{7}i}{2}$. Denote by

$$\begin{aligned} \xi _k=M_k+(\theta -1)M_{k-1}. \end{aligned}$$

(27)

We have the following result.

Lemma 1

If $\gamma _k\in (0,1)$ and

$$\begin{aligned} |\xi _2|>8\log \lambda , \end{aligned}$$

(28)

there exists $c_1>0$ such that

$$\begin{aligned} |\xi _k|\ge \left( \sqrt{2}-1\right) 2^{\frac{k}{2}}c_1,~~k\ge 2. \end{aligned}$$

(29)

Proof

It follows from (26), the definition of $\theta $, and (27) that

$$\begin{aligned} \xi _{k+1}=\theta M_k-2M_{k-1}+2\log h_k(q_{k-1})=\theta \xi _k+2\log h_k(q_{k-1}). \end{aligned}$$

(30)

By (25), we know that

$$\begin{aligned} 0<\log h_k(q_{k-1})<\log (\gamma _k+(1-\gamma _k)\lambda )<\log \lambda . \end{aligned}$$

Since $|\theta |=\sqrt{2}$, we get by (30) that

$$\begin{aligned} |\xi _{k+1}|\ge \sqrt{2}\,|\xi _k|-c_1, \end{aligned}$$

where $c_1=2\log \lambda $. From (28), we have

$$\begin{aligned} |\xi _{k+1}|&\ge 2^{\frac{k-1}{2}}|\xi _2|-\frac{2^{\frac{k}{2}}-1}{\sqrt{2}-1}c_1 \ge \left( 2^{\frac{k+3}{2}}-\frac{2^{\frac{k}{2}}-1}{\sqrt{2}-1}\right) c_1\\&=\left[ \left( \sqrt{2}-1\right) \left( 2^{\frac{k}{2}}+1\right) +2\right] c_1 >(\sqrt{2}-1)2^{\frac{k}{2}}c_1. \end{aligned}$$

This completes the proof. $\square $

Since $|\theta -1|=\sqrt{2}$, we obtain by (27) that

$$\begin{aligned} |\xi _k|&\le |M_k|+|\theta -1||M_{k-1}|=|M_k|+\sqrt{2}|M_{k-1}|\\&\le (\sqrt{2}+1)\max \{|M_k|,|M_{k-1}|\}, \end{aligned}$$

which, together with (29), gives

$$\begin{aligned} \max \{|M_k|,|M_{k-1}|\}\ge \frac{1}{\sqrt{2}+1}\left( \sqrt{2}-1\right) 2^{\frac{k}{2}}c_1 =(\sqrt{2}-1)^22^{\frac{k}{2}}c_1. \end{aligned}$$

(31)

Lemma 2

Under the conditions of Lemma 1, for $k\ge 2$, the following inequalities hold:

$$\begin{aligned}&\max _{-1\le i\le 3}M_{k+i}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1, \end{aligned}$$

(32)

$$\begin{aligned}&\min _{-1\le i\le 3}M_{k+i}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1. \end{aligned}$$

(33)

Proof

The inequality (32) holds true if

$$\begin{aligned} M_{k-1}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1 \end{aligned}$$

or

$$\begin{aligned} M_{k}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1. \end{aligned}$$

Suppose that the above two inequalities are false. By (31), we know that either

$$\begin{aligned} M_{k-1}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1 \end{aligned}$$

or

$$\begin{aligned} M_{k}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1. \end{aligned}$$

From (26), we have

$$\begin{aligned} M_{k+2}&=M_{k+1}-2M_k+2\log h(q_k)\nonumber \\&=-M_k-2M_{k-1}+2\log h(q_k)+2\log h_k(q_{k-1}). \end{aligned}$$

(34)

(i) $M_{k-1}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1$. If $M_{k}<0$, it follows from (34) that

$$\begin{aligned} M_{k+2}\ge -2M_{k-1}+2\log h_k(q_{k-1})\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1. \end{aligned}$$

Otherwise, if $M_{k}\ge 0$, by (26), we get

$$\begin{aligned} M_{k+1}\ge -2M_{k-1}+2\log h_k(q_{k-1})\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1. \end{aligned}$$

(ii) $M_{k}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1$. Similar to (i), we can show that

$$\begin{aligned} M_{k+3}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1 \end{aligned}$$

or

$$\begin{aligned} M_{k+2}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1. \end{aligned}$$

Thus, the inequality (32) is valid. The inequality (33) can be established in a similar way. $\square $

Theorem 2

If $\gamma _k\in (0,1)$, (19) and (28) hold, the sequence $\{\Vert g_k\Vert \}$ converges to zero R-superlinearly.

Proof

From (22), we have

$$\begin{aligned} |g_{k+1}^{(2)}|&=\frac{q_{k-1}(\lambda -1) \left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})\right] }{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(2)}|\nonumber \\&\le \frac{q_{k-1}(\lambda -1)(\lambda ^2+q_{k-1})}{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(2)}|\nonumber \\&\le (\lambda -1)q_{k-1}|g_k^{(2)}|. \end{aligned}$$

(35)

Since $\alpha _k\in (\lambda ^{-1},1)$, we have

$$\begin{aligned} |g_{k+1}^{(i)}|\le (\lambda -1)|g_k^{(i)}|,~~i=1,2, \end{aligned}$$

(36)

which gives

$$\begin{aligned} |g_{k+5}^{(i)}|\le (\lambda -1)^{(5-j)}|g_{k+j}^{(i)}|,~~i=1,2,~j=1,\ldots ,5. \end{aligned}$$

(37)

It follows from (35), (36) and (37) that

$$\begin{aligned} |g_{k+5}^{(2)}|&\le (\lambda -1)^{(5-j+1)}q_{k+j-2}|g_{k+j-1}^{(2)}|\nonumber \\&\le (\lambda -1)^{5}q_{k+j-2}|g_{k}^{(2)}| ,~~i=1,2,~j=1,\ldots ,5, \end{aligned}$$

(38)

which indicates that

$$\begin{aligned} |g_{k+5}^{(2)}|\le (\lambda -1)^5\left( \min _{-1\le i\le 3}q_{k+i}\right) |g_k^{(2)}|. \end{aligned}$$

(39)

As $M_k=\log q_k$, we know by Lemma 2 and (39) that

$$\begin{aligned} |g_{k+5}^{(2)}|\le (\lambda -1)^5\exp \left( -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1\right) |g_k^{(2)}|. \end{aligned}$$

Similarly to (35), we have

$$\begin{aligned} |g_{k+1}^{(1)}|&=\frac{(\lambda -1) \left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)\lambda (\lambda +q_{k-1})\right] }{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(1)}|\\&\le \frac{\lambda (\lambda -1)(\lambda +q_{k-1})}{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(1)}|\\&<\lambda (\lambda -1)\frac{1}{q_{k-1}}|g_k^{(1)}|, \end{aligned}$$

which together with (37) yields that

$$\begin{aligned} |g_{k+5}^{(1)}|&\le \lambda (\lambda -1)^5\frac{1}{\max \limits _{-1\le i\le 3}q_{k+i}}|g_k^{(1)}|\nonumber \\&\le \lambda (\lambda -1)^5\exp \left( -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1\right) |g_k^{(1)}|. \end{aligned}$$

(40)

By (35) and (40), for any k, we have

$$\begin{aligned} \Vert g_{k+5}\Vert \le \lambda (\lambda -1)^5\exp \left( -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1\right) \Vert g_k\Vert . \end{aligned}$$

That is, $\{\Vert g_k\Vert \}$ converges to zero R-superlinearly. $\square $

Theorem 2, together with the analysis for the BB method in [2, 8], shows that, for any $\gamma _k\in [0,1]$, the family (15) is R-superlinearly convergent.

3.2 n-dimensional case

In this subsection, we show R-linear convergence of the family (15) for n-dimensional quadratics.

Dai [7] has proved that if A has the form (18) with $1=\lambda _1\le \lambda _2\le \cdots \le \lambda _n$ and the stepsizes of gradient method (2) have the following Property (A), then either $g_k=0$ for some finite k or the sequence $\{\Vert g_k\Vert \}$ converges to zero R-linearly.

Property (A)

[7] Suppose that there exist an integer m and positive constants $M_1\ge \lambda _1$ and $M_2$ such that

(i)
$\lambda _1\le \alpha _k^{-1}\le M_1$;
(ii)
for any integer $l\in [1,n-1]$ and $\epsilon >0$, if $G(k-j,l)\le \epsilon $ and $(g_{k-j}^{(l+1)})^2\ge M_2\epsilon $ hold for $j\in [0,\min \{k,m\}-1]$, then $\alpha _k^{-1}\ge \frac{2}{3}\lambda _{l+1}$.

Here,

$$\begin{aligned} G(k,l)=\sum _{i=1}^l(g_k^{(i)})^2. \end{aligned}$$

Now we show R-linear convergence of the proposed family by proving that the stepsize (15) satisfies Property (A).

Theorem 3

Suppose that the sequence $\{\Vert g_k\Vert \}$ is generated by the family (15) applied to n-dimensional quadratics with the matrix A has the form (18) and $1=\lambda _1\le \lambda _2\le \cdots \le \lambda _n$. Then either $g_k=0$ for some finite k or the sequence $\{\Vert g_k\Vert \}$ converges to zero R-linearly.

Proof

Let $M_1=\lambda _n$ and $M_2=2$. We have by (15) and the fact $\gamma _k\in [0,1]$ that

$$\begin{aligned} \alpha _{k}^{BB2}\le \alpha _k\le \alpha _{k}^{BB1}. \end{aligned}$$

Thus, (i) of Property (A) holds. If $G(k-j,l)\le \epsilon $ and $(g_{k-j}^{(l+1)})^2\ge M_2\epsilon $ hold for $j\in [0,\min \{k,m\}-1]$, we have

$$\begin{aligned} \alpha _{k}^{-1}&\ge \frac{1}{\alpha _{k}^{BB1}}=\frac{1}{\alpha _{k-1}^{SD}} =\frac{\sum _{i=1}^n\lambda _i(g^{(i)}_{k-1})^2}{\sum _{i=1}^n(g^{(i)}_{k-1})^2}\nonumber \\&\ge \frac{\lambda _{l+1}\sum _{i=l+1}^n(g^{(i)}_{k-1})^2}{\epsilon \Vert g_{k-1}\Vert ^2+\sum _{i=l+1}^n(g^{(i)}_{k-1})^2}\nonumber \\&\ge \frac{M_2}{M_2+1}\lambda _{l+1} =\frac{2}{3}\lambda _{l+1}. \end{aligned}$$

That is, (ii) of Property (A) holds. This completes the proof. $\square $

4 Selecting $\gamma _{k}$

In this section, we present three different selection rules for the parameter $\gamma _{k}$ of the family (15).

The simplest scheme for choosing $\gamma _{k}$ is to fix it for all iterations. For example, we can set $\gamma _{k}=0.1, 0.2$, etc. However, since the information carried by the two BB stepsizes changes as the iteration process going on such a fixed scheme may deteriorate the performance because it fixes the ratios of the long BB stepsize $\alpha _{k}^{BB1}$ and the short BB stepsize $\alpha _{k}^{BB2}$ contributed to the stepsize $\alpha _{k}$. Thus, it is better to vary $\gamma _{k}$ at each iteration.

One direct way for modifying $\gamma _{k}$ is, as the randomly relaxed Cauchy method [35], randomly choose it in the interval (0, 1). But this scheme determines the value of $\gamma _{k}$ without using any information at the current and former iterations.

The next strategy borrows the idea of cyclic gradient methods [7, 10, 12, 35], where a stepsize is reused for m iterations. Such a cyclic scheme is superior to its noncyclic counterpart in both theory and practice. Dai and Fletcher [10] showed that if the cyclic length m is greater than n / 2, the cyclic SD method is likely to be R-superlinearly convergent. Similar numerical convergence evidences were also observed for the CBB1 method in [12]. Motivated by those advantages of the cyclic scheme, for the family (15), we choose $\gamma _{k}$ such that the current stepsize approximates the former one as much as possible. That is,

$$\begin{aligned} \gamma _{k}=\arg \min _{\gamma \in [0,1]}~\left| \gamma \alpha _{k}^{BB1}+(1-\gamma )\alpha _{k}^{BB2}-\alpha _{k-1}\right| , \end{aligned}$$

which yields

$$\begin{aligned} \gamma _{k}^{C}=\min \left\{ 1,\max \left\{ 0,\frac{\alpha _{k-1}-\alpha _{k}^{BB2}}{\alpha _{k}^{BB1}-\alpha _{k}^{BB2}}\right\} \right\} . \end{aligned}$$

(41)

Clearly, $\gamma _{k}^{C}=0$ when $\alpha _{k-1}\le \alpha _{k}^{BB2}$; $\gamma _{k}^{C}=1$ when $\alpha _{k-1}\ge \alpha _{k}^{BB1}$. This gives the following stepsize:

$$\begin{aligned} \tilde{\alpha }_{k}=\left\{ \begin{array}{ll} \alpha _{k}^{BB2}, &{}\quad \hbox {if }\alpha _{k-1}\le \alpha _{k}^{BB2}; \\ \alpha _{k}^{BB1}, &{}\quad \hbox {if }\alpha _{k-1}\ge \alpha _{k}^{BB1};\\ \alpha _{k-1}, &{} \quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$

(42)

Recall that, for quadratics, $\alpha _{k}^{BB1}=\alpha _{k-1}^{SD}$ and $\alpha _{k}^{BB2}=\alpha _{k-1}^{MG}$, where

$$\begin{aligned} \alpha _{k}^{MG}=\arg \min _{\alpha \in \mathbb {R}} \Vert g(x_{k}-\alpha g_{k})\Vert =\frac{g_{k}^TAg_{k}}{g_{k}^TA^2g_{k}}, \end{aligned}$$

which is a short stepsize and satisfies $\alpha _{k}^{MG}\le \alpha _{k}^{SD}$. We refer the reader to [15] for additional details on $\alpha _{k}^{MG}$. It follows from (42) that $\tilde{\alpha }_{k}$ is adaptively selected and truncated by making use of the information of the former iteration. In particular, the stepsize is increased if the former one is too short (i.e., $\alpha _{k-1}\le \alpha _{k-1}^{MG}$) while it is decreased if the former one is too long (i.e., $\alpha _{k-1}\ge \alpha _{k-1}^{SD}$). Moreover, the former stepsize will be reused if it lies in $[\alpha _{k}^{BB2},\alpha _{k}^{BB1}]$. Thus, (42) is an adaptive truncated cyclic scheme and we will call it ATC for short.

As cyclic methods, we need to update the stepsize every m iterations to avoid using a stepsize for too many iterations. Many different stepsizes can be employed. Here, we suggest three candidates. The first is the long BB stepsize $\alpha _{k}^{BB1}$, i.e.,

$$\begin{aligned} \alpha _{k}^{ATC1}=\left\{ \begin{array}{ll} \alpha _{k}^{BB1}, &{} \quad \hbox {if }mod(k,m)=0;\\ \tilde{\alpha }_{k}, &{} \quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$

(43)

The second is the short BB stepsize $\alpha _{k}^{BB2}$, i.e.,

$$\begin{aligned} \alpha _{k}^{ATC2}=\left\{ \begin{array}{ll} \alpha _{k}^{BB2}, &{} \quad \hbox {if }mod(k,m)=0;\\ \tilde{\alpha }_{k}, &{}\quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$

(44)

The last is $\alpha _{k}^P$ given by (16), which is a special case of the family (15). That is,

$$\begin{aligned} \alpha _{k}^{ATC3}=\left\{ \begin{array}{ll} \alpha _{k}^P, &{}\quad \hbox {if }mod(k,m)=0;\\ \tilde{\alpha }_{k}, &{}\quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$

(45)

In what follows we shall refer (43), (44) and (45) to as ATC1, ATC2 and ATC3, respectively.

We tested the family (15) with fixed $\gamma _{k}$ on quadratic problems to see how the values of $\gamma _{k}$ affect the performance. In particular, $\gamma _{k}$ is set to 0.1, 0.3, 0.5, 0.7 and 0.9 for all k, respectively. The examples in [15, 24, 40] were employed, where $A=QVQ^T$ with

$$\begin{aligned} Q=(I-2w_3w_3^T)(I-2w_2w_2^T)(I-2w_1w_1^T), \end{aligned}$$

and $w_1$, $w_2$, and $w_3$ are unitary random vectors, $V=diag(v_1,\ldots ,v_n)$ is a diagonal matrix where $v_1=1$, $v_n=\kappa $, and $v_j$, $j=2,\ldots ,n-1$, are randomly generated between 1 and $\kappa $. We stopped the iteration if the number of iteration exceeds 20,000 or

$$\begin{aligned} \Vert g_k\Vert \le \epsilon \Vert g_1\Vert \end{aligned}$$

(46)

is satisfied for some given $\epsilon $.

Four values of the condition number $\kappa $: $10^3, 10^4, 10^5, 10^6$ as well as three values of $\epsilon $: $10^{-6}, 10^{-9}, 10^{-12}$ were used. For each value of $\kappa $ and $\epsilon $, 10 instances with $v_j$ evenly distributed in $[1,\kappa ]$ were generated. For each instance, the entries of b were randomly generated in $[-10,10]$ and the vector $e=(1,\ldots ,1)^T$ was used as the starting point.

The BB method was also run for comparison, i.e. $\gamma _{k}=1$. We compared the performance of the algorithms by the required number of iterations, as described in [21]. In other words, for each method, we plot the ratio of problems for which the method is within a factor $\rho $ of the least iterations. For the 100-dimensional case, we can see from Fig. 1 that the performance of the family improves as $\gamma _{k}$ becomes larger. However, for the 1000-dimensional case, Fig. 2 shows that the family (15) with $\gamma _{k}=0.7$ or 0.9 can outperform the BB method for some $\rho $ around 1.5. That is, for some problems, the long BB stepsize $\alpha _{k}^{BB1}$ may not be the best choice in the family.

Then, we applied ATC1, ATC2 and ATC3 to the above problem with $n=1000$ and different $v_j$ distributions. Particularly, two sets were generated: (1) $v_j$ are evenly distributed in $[1,\kappa ]$ as the former example; (2) 20% of $v_j$ are evenly distributed in [1, 100] and others are in $[\frac{\kappa }{2},\kappa ]$. We used three values of the condition number $\kappa =10^4, 10^5, 10^6$ and also three values of $\epsilon $ as the above problem. Other settings were same as the former example.

Table 1 Number of iterations of ATC1, ATC2 and ATC3 with different m on problems with different spectrum distributions

Full size table

We tested the three methods with different m. The average number of iterations are presented in Table 1. We can see that, for each $\kappa $ and m, ATC1 often outperforms ATC2 and ATC3. The performances of the three methods do not improve as m increases. For the first set of problems, ATC1 with $m=30$ performs better than other values. For the second set of problems, ATC1 with $m=8$ dominates others. Thus, in the next section we only run ATC1 using these settings.

We close this section by mentioning that there are many other different rules for computing the parameter $\gamma _{k}$. For example, as the alternate gradient method [7, 11], we can choose $\gamma _{k}$ to alternate short stepsizes and long stepsizes. In addition, we can also use sophisticated random schemes for $\gamma _{k}$, see [30] and references therein.

5 Numerical results

In this section, we present numerical results of the family (15) with different settings of the parameter $\gamma _k$. We compare the performance of the ATC1 method with the following methods: the family (15) with $\gamma _k$ randomly chose in (0, 1) (RAND), BB [2], ALBB [11], ABB [40], CBB1 [12], CBB2, CP, Dai-Yuan (DY) [16], ABBmin1 and ABBmin2 [23], SDC [17], and the family (15) with basic adaptive truncated cyclic scheme (42) (ATC). Since the SDC method performs better than its monotone counterpart for most problems, we only run SDC.

All methods were implemented in MATLAB (v.9.0-R2016a). All the runs were carried out on a PC with an Intel Core i7, 2.9 GHz processor and 8 GB of RAM running Windows 10 system. Moreover, we stopped the iteration if the number of iteration exceeds 20,000 or (46) is satisfied for some given $\epsilon $.

Firstly, we considered randomly generated problems in the former section with different $v_j$ distributions as shown in Table 2. The first two sets are same as the former section. For the third set, half of $v_j$ are in [1, 100] and the other half in $[\frac{\kappa }{2},\kappa ]$; for the fourth set, 80% of $v_j$ are evenly distributed in [1, 100] and others are in $[\frac{\kappa }{2},\kappa ]$. The fifth set has 20% of $v_j$ are in [1, 100], 20% of $v_j$ are in $[100,\frac{\kappa }{2}]$ and the others in $[\frac{\kappa }{2},\kappa ]$. The last two sets only has either 10 small $v_j$ or 10 large $v_j$. Other settings were also same as the former section.

For the three cyclic methods: CBB1, CBB2 and CP, the best m among $\{3,4,\ldots ,10\}$ was chosen. In particular, $m=3$ for CBB1 and $m=4$ for CBB2 and CP. As in [23], the parameters used by the ABBmin1 and ABBmin2 methods are set to $\tau =0.8$, $m=9$ and $\tau =0.9$, respectively. While $\tau =0.1$ was used for the ABB method which is better than 0.15 and 0.2 in our test. The pair (h, m) of the SDC method was set to (8, 6) which is better than other choices. As for our ATC1 method, we set $m=30$ for the first and fifth sets of problems and $m=8$ for other sets.

Table 2 Distributions of $v_j$

Full size table

It can be observed from Table 3 that, the rand scheme performs worse than the fixed scheme with $\gamma _k=1$, i.e., the BB method. The ATC method is competitive with the BB method and dominates the ALBB and CP methods for most test problems. The CBB2 method clearly outperforms the other two cyclic methods: CBB1 and CP. Moreover, the CBB2 method performs much better than the ATC and ABB methods except the first and fifth sets of problems. The CBB2 method is even faster than the DY method. Although the ABBmin2 method is the fastest one for solving the first and sixth sets of problems, it is worse than the ABBmin1, SDC and ATC1 methods for the other five sets of problems. Our ATC1 method is faster than the CBB2 method except the last set of problems. In addition, the ATC1 method is much better than the ABBmin1 and SDC methods on the first set of problems and comparable to them on the other sets of problems. Furthermore, for each tolerance, our ATC1 method is the fastest one in the sense of total number of iterations.

Table 3 Number of iterations of compared methods on problems in Table 2

Full size table

Then, the compared methods were applied to the non-rand quadratic minimization problem in [17] which is more difficult than its rand counterpart. In particular, A is diagonal whose elements are given by

$$\begin{aligned} A_{jj}=10^{\frac{ncond}{n-1}(n-j)}, ~~~j=1,\ldots ,n, \end{aligned}$$

(47)

where $ncond=\log _{10} \kappa $, and the vector b is null. We tested 10,000-dimensional problems with three different condition numbers: $\kappa =10^4, 10^5, 10^6$. The stopping condition (46) was employed with $\epsilon =10^{-6}, 10^{-9}, 10^{-12}$ for all methods. For each value of $\kappa $ or $\epsilon $, 10 different starting points with entries in $[-10,10]$ were randomly generated.

Due to the performances of these compared methods on the above problems, only fast methods were tested, i.e., ABB, CBB2, DY, ABBmin1, ABBmin2, SDC, ATC and ATC1. The numbers of iterations averaged over those starting points of each method are listed in Table 4. Here, the results of the SDC method were obtained with the best choice of parameter setting in [17], i.e., $h=30$ and $m=2$.

From Table 4 we can see that the ATC method is slightly better than the CBB2 method and comparable to the DY method for most problems. The ABB method performs better than the ABBmin1 method. Although the ABB method is slower than the ABBmin2 method when the tolerance is low, it wins if a tight tolerance is required. Our ATC1 method is competitive with other methods especially for a tight tolerance. Moreover, our ATC1 method takes least total number of iterations to meet the required tolerance.

Table 4 Number of iterations of compared methods on non-rand quadratic problems

Full size table

6 Conclusions

We have proposed a family of spectral gradient methods which calculates the stepsize by the convex combination of the long BB stepsize and the short BB stepsize, i.e., $\alpha _k=\gamma _k\alpha _k^{BB1}+(1-\gamma _k)\alpha _k^{BB2}$. Similar to the two BB stepsizes, each stepsize in the family possesses certain quasi-Newton property. In addition, R-superlinear and R-linear convergence of the family were established for two-dimensional and n-dimensional strictly convex quadratics, respectively. Furthermore, with different choices of the parameter $\gamma _k$, we obtain different stepsizes and gradient methods. The family provides us an alternative for the stepsize of gradient methods which can be easily extended to general problems by incorporating some line searches.

Since the parameter $\gamma _k$ affects the value of $\alpha _k$ and hence the efficiency of the method, it is interesting to investigate how to choose a proper $\gamma _k$ to achieve satisfactory performance. We have proposed and tested three different selection rules for $\gamma _k$, among which the adaptive truncated cyclic scheme with the long BB stepsize $\alpha _k^{BB1}$, i.e., the ATC1 method performs best. In addition, our ATC1 method is comparable to the state-of-the-art gradient methods including the Dai-Yuan, ABBmin1, ABBmin2 and SDC methods. One interesting question is how to design an efficient scheme to choose m adaptively for the proposed method. This will be our future work.

References

Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
Article MathSciNet MATH Google Scholar
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Article MathSciNet MATH Google Scholar
Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
Article MathSciNet MATH Google Scholar
Birgin, E.G., Martínez, J.M., Raydan, M., et al.: Spectral projected gradient methods: review and perspectives. J. Stat. Softw. 60(3), 539–559 (2014)
Article MATH Google Scholar
Broyden, C.G.: A class of methods for solving nonlinear simultaneous equations. Math. Comput. 19(92), 577–593 (1965)
Article MathSciNet MATH Google Scholar
Cauchy, A.: Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
Google Scholar
Dai, Y.H.: Alternate step gradient method. Optimization 52(4–5), 395–415 (2003)
Article MathSciNet MATH Google Scholar
Dai, Y.H.: A new analysis on the Barzilai–Borwein gradient method. J. Oper. Res. Soc. China 2(1), 187–198 (2013)
Article MATH Google Scholar
Dai, Y.H., Al-Baali, M., Yang, X.: A positive Barzilai–Borwein-like stepsize and an extension for symmetric linear systems. In: Numerical Analysis and Optimization, pp. 59–75. Springer (2015)
Dai, Y.H., Fletcher, R.: On the asymptotic behaviour of some new gradient methods. Math. Program. 103(3), 541–559 (2005)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Fletcher, R.: Projected Barzilai–Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 100(1), 21–47 (2005)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Hager, W.W., Schittkowski, K., Zhang, H.: The cyclic Barzilai–Borwein method for unconstrained optimization. IMA J. Numer. Anal. 26(3), 604–627 (2006)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Kou, C.: A Barzilai–Borwein conjugate gradient method. Sci. China Math. 59, 1511–1524 (2016)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Liao, L.Z.: $R$-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Manag. Optim. 1(2), 181 (2005)
MathSciNet MATH Google Scholar
De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comput. Optim. Appl. 59(3), 541–563 (2014)
Article MathSciNet MATH Google Scholar
De Asmundis, R., Di Serafino, D., Riccio, F., Toraldo, G.: On spectral properties of steepest descent methods. IMA J. Numer. Anal. 33(4), 1416–1435 (2013)
Article MathSciNet MATH Google Scholar
Dennis Jr., J.E., Moré, J.J.: Quasi-Newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)
Article MathSciNet MATH Google Scholar
Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
MathSciNet MATH Google Scholar
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Article MathSciNet MATH Google Scholar
Fletcher, R.: On the Barzilai–Borwein method. In: Optimization and Control with Applications, pp. 235–256 (2005)
Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Manag. Optim. 4(2), 299 (2008)
MathSciNet MATH Google Scholar
Friedlander, A., Martínez, J.M., Molina, B., Raydan, M.: Gradient method with retards and generalizations. SIAM J. Numer. Anal. 36(1), 275–289 (1998)
Article MathSciNet MATH Google Scholar
Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comput. Optim. Appl. 63(2), 523–542 (2016)
Article MathSciNet MATH Google Scholar
Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal. 23(4), 707–716 (1986)
Article MathSciNet MATH Google Scholar
Huang, Y., Liu, H.: Smoothing projected Barzilai–Borwein method for constrained non-Lipschitz optimization. Comput. Optim. Appl. 65(3), 671–698 (2016)
Article MathSciNet MATH Google Scholar
Huang, Y., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai–Borwein method for nonnegative matrix factorization. Data Min. Knowl. Discov. 29(6), 1665–1684 (2015)
Article MathSciNet MATH Google Scholar
Jiang, B., Dai, Y.H.: Feasible Barzilai–Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Methods Softw. 28(4), 756–784 (2013)
Article MathSciNet MATH Google Scholar
Kalousek, Z.: Steepest descent method with random step lengths. Found. Comput. Math. 17(2), 359–422 (2017)
Article MathSciNet MATH Google Scholar
Liu, Y.F., Dai, Y.H., Luo, Z.Q.: Coordinated beamforming for MISO interference channel: complexity analysis and efficient algorithms. IEEE Trans. Signal Process. 59(3), 1142–1157 (2011)
Article MathSciNet MATH Google Scholar
Nocedal, J., Sartenaer, A., Zhu, C.: On the behavior of the gradient norm in the steepest descent method. Comput. Optim. Appl. 22(1), 5–35 (2002)
Article MathSciNet MATH Google Scholar
Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Article MathSciNet MATH Google Scholar
Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
Article MathSciNet MATH Google Scholar
Raydan, M., Svaiter, B.F.: Relaxed steepest descent and Cauchy–Barzilai–Borwein method. Comput. Optim. Appl. 21(2), 155–167 (2002)
Article MathSciNet MATH Google Scholar
Tan, C., Ma, S., Dai, Y.H., Qian, Y.: Barzilai–Borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685–693 (2016)
Wang, Y., Ma, S.: Projected Barzilai–Borwein method for large-scale nonnegative image restoration. Inverse Probl. Sci. Eng. 15(6), 559–583 (2007)
Article MathSciNet MATH Google Scholar
Wright, S.J., Nowak, R.D., Figueiredo, M.A.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
Article MathSciNet MATH Google Scholar
Yuan, Y.X.: Step-sizes for the gradient method. AMS IP Stud. Adv. Math. 42(2), 785–796 (2008)
MathSciNet MATH Google Scholar
Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comput. Optim. Appl. 35(1), 69–86 (2006)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are very grateful to the associate editor and the two anonymous referees whose suggestions and comments greatly improved the quality of this paper.

Author information

Authors and Affiliations

LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
Yu-Hong Dai
Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
Yu-Hong Dai
Institute of Mathematics, Hebei University of Technology, Tianjin, 300401, China
Yakui Huang & Xin-Wei Liu

Authors

Yu-Hong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yakui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xin-Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yakui Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Natural Science Foundation of China (Grant Nos. 11631013, 11701137, 11671116) and the National 973 Program of China (Grant No. 2015CB856002).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dai, YH., Huang, Y. & Liu, XW. A family of spectral gradient methods for optimization. Comput Optim Appl 74, 43–65 (2019). https://doi.org/10.1007/s10589-019-00107-8

Download citation

Received: 18 May 2018
Published: 06 May 2019
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s10589-019-00107-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A family of spectral gradient methods for optimization

Abstract

Similar content being viewed by others

A family of optimal weighted conjugate-gradient-type methods for strictly convex quadratic minimization

Two spectral conjugate gradient methods for unconstrained optimization problems

A Modified Spectral Conjugate Gradient Method with Global Convergence

1 Introduction

2 Quasi-Newton property of the family (15)

Proposition 1

Proof

Proposition 2

Proof

Theorem 1

Proof

3 Convergence analysis

3.1 Two-dimensional case

Lemma 1

Proof

Lemma 2

Proof

Theorem 2

Proof

3.2 n-dimensional case

Property (A)

Theorem 3

Proof

4 Selecting \(\gamma _{k}\)

5 Numerical results

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation