1 Introduction

Consider the unconstrained optimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^n}~~f(x), \end{aligned}$$
(1)

where \(f(x): \mathbb {R}^n\rightarrow \mathbb {R}\) is a continuously differentiable function. The gradient method solves problem (1) by updating iterates as

$$\begin{aligned} x_{k+1}=x_k-\alpha _kg_k, \end{aligned}$$
(2)

where \(g_k=\nabla f(x_k)\) and \(\alpha _k>0\) is the stepsize. Different gradient methods use different formulae for stepsizes.

The simplest gradient method is the steepest descent (SD) method due to Cauchy [6], which computes the stepsize by exact line search,

$$\begin{aligned} \alpha _k^{SD}=\arg \min _{\alpha \in \mathbb {R}}~f(x_k-\alpha g_k). \end{aligned}$$

As is well known, every two consecutive gradients generated by the SD method are perpendicular to each other. Moreover, if f(x) is a strictly convex quadratic function, i.e.,

$$\begin{aligned} f(x)=\frac{1}{2}x^TAx-b^Tx, \end{aligned}$$
(3)

where \(A\in \mathbb {R}^{n\times n}\) is symmetric positive definite and \(b\in \mathbb {R}^n\), it can be shown that the gradients will asymptotically reduce to a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the matrix A and hence zigzag occurs, see [1, 32] for more details. This property seriously deteriorates the performance of the SD method, especially when the condition number of A is large.

An important approach that changes our perspectives on the effectiveness of gradient methods is proposed by Barzilai and Borwein [2]. They viewed the updating rule (2) as

$$\begin{aligned} x_{k+1}=x_k-D_kg_k, \end{aligned}$$
(4)

where \(D_k=\alpha _kI\). Similar to the quasi-Newton method [19], \(D_k^{-1}\) is required to satisfy the secant equation

$$\begin{aligned} B_ks_{k-1}=y_{k-1} \end{aligned}$$
(5)

to approximate the Hessian as possible as it can. Here, \(s_{k-1}=x_k-x_{k-1}\) and \(y_{k-1}=g_k-g_{k-1}\). However, since \(D_k\) is diagonal with identical diagonal elements, it is usually impossible to find an \(\alpha _k\) such that \(D_k^{-1}\) fulfills (5) if the dimension \(n>1\). Thus, Barzilai and Borwein required \(D_k^{-1}\) to meet the secant equation in the sense of least squares,

$$\begin{aligned} D_k=\arg \min _{D=\alpha I}~\Vert D^{-1}s_{k-1}-y_{k-1}\Vert , \end{aligned}$$
(6)

which yields

$$\begin{aligned} \alpha _k^{BB1}=\frac{s_{k-1}^Ts_{k-1}}{s_{k-1}^Ty_{k-1}}. \end{aligned}$$
(7)

Here and below, \(\Vert \cdot \Vert \) means the Euclidean norm. On the other hand, one can also calculate the stepsize by requiring \(D_k\) to satisfy

$$\begin{aligned} H_ky_{k-1}=s_{k-1}. \end{aligned}$$
(8)

That is,

$$\begin{aligned} D_k=\arg \min _{D=\alpha I}~\Vert s_{k-1}-Dy_{k-1}\Vert , \end{aligned}$$
(9)

which gives

$$\begin{aligned} \alpha _k^{BB2}=\frac{s_{k-1}^Ty_{k-1}}{y_{k-1}^Ty_{k-1}}. \end{aligned}$$
(10)

Apparently, when \(s_{k-1}^Ty_{k-1}>0\), there holds \(\alpha _{k}^{BB1}\ge \alpha _{k}^{BB2}\). In other words, \(\alpha _{k}^{BB1}\) is a long stepsize while \(\alpha _{k}^{BB2}\) is a short one, which implies that \(\alpha _{k}^{BB1}\) is more aggressive than \(\alpha _{k}^{BB2}\) in decreasing the objective value. Extensive numerical experiments show that the long stepsize is superior to the short one in many cases, see [4, 22, 34] for example. In what follows we will refer to the gradient method with the long stepsize \(\alpha _{k}^{BB1}\) as the BB method without specification.

Barzilai and Borwein [2] proved their method with the short BB stepsize \(\alpha _{k}^{BB2}\) is R-superlinearly convergent for two-dimensional strictly convex quadratics. An improved R-superlinear convergence result for the BB method was given by Dai [8]. Global and R-linear convergence of the BB method for general n-dimensional strictly convex quadratics were established by Raydan [33] and Dai and Liao [14], respectively. The BB method has also been extended to solve general nonlinear optimization problems. By incorporating the nonmontone line search proposed by Grippo et al. [26], Raydan [34] developed the global BB method for general unconstrained problems. Later, Birgin et al. [3] proposed the so-called spectral projected gradient method which extends Raydan’s method to smooth convex constrained problems. Dai and Fletcher [11] designed projected BB methods for large-scale box-constrained quadratic programming. Recently, by resorting to the smoothing techniques, Huang and Liu [27] generalized the projected BB method with modifications to solve non-Lipschitz optimization problems.

The relationship between the stepsizes in BB-like methods and the spectrum of the Hessian of the objective function has been explored in several studies. Frassoldati et al. [23] tried to exploit the long BB stepsize close to the reciprocal of the smallest eigenvalue of the Hessian, yielding the ABBmin1 and ABBmin2 methods. De Asmundis et al. [18] developed the so-called SDA method which employs a short stepsize approximates the reciprocal of the largest eigenvalue of the Hessian. Following the line of [18], Gonzaga and Schneider [25] suggested a monotone method for quadratics where the stepsizes are obtained in a way similar to the SD method. De Asmundis et al. [17] proposed the SDC method which exploits the spectral property of Yuan’s stepsize [16]. Kalousek [30] considered the SD method with random stepsizes lying between the reciprocal of the largest eigenvalue and the smallest eigenvalue of the Hessian and analysed the optimal distribution of random stepsizes that guarantees the maximum asymptotic convergence rate.

Applications of the BB method and its variants have largely been developed for problems arising in various different areas including image restoration [37], signal processing [31], eigenvalue problems [29], nonnegative matrix factorization [28], sparse reconstruction [38], machine learning [36], etc. We refer the reader to [4, 7, 13, 20, 22, 39] and references therein for more spectral gradient methods and extensions.

The success of the BB method and its variants motivates us to consider spectral gradient methods. Our goal is to present a family of spectral gradient methods for optimization. Notice that the Broyden class of quasi-Newton methods [5] approximate the inverse of the Hessian by

$$\begin{aligned} H_{k}^\tau =\tau H_k^{BFGS} + (1-\tau )H_k^{DFP}, \end{aligned}$$
(11)

where \(\tau \in [0,1]\) is a scalar parameter and \(H_k^{BFGS}\) and \(H_k^{DFP}\) are the BFGS and DFP matrices, respectively, that satisfy the secant equation (8), which further implies that

$$\begin{aligned} \tau H_k^{BFGS}y_{k-1} + (1-\tau )H_k^{DFP}y_{k-1}=s_{k-1}, \end{aligned}$$

i.e.,

$$\begin{aligned} \tau (H_k^{BFGS}y_{k-1}-s_{k-1}) + (1-\tau )(H_k^{DFP}y_{k-1}-s_{k-1})=0. \end{aligned}$$
(12)

Since the inverse of \(H_k^{BFGS}\), say \(B_k^{BFGS}\), satisfies (5), we can modify (12) as

$$\begin{aligned} \tau (B_k^{BFGS}s_{k-1}-y_{k-1}) + (1-\tau )(s_{k-1}-H_k^{DFP}y_{k-1})=0. \end{aligned}$$
(13)

Motivated by the above observation, we employ the idea of the BB method to approximate the Hessian and its inverse by diagonal matrices. Particularly, we require the matrix \(D=\alpha I\) to be the solution of

$$\begin{aligned} \min _{D=\alpha I}~\Vert \tau (D^{-1}s_{k-1}-y_{k-1})+(1-\tau )(s_{k-1}-D y_{k-1})\Vert . \end{aligned}$$
(14)

In the next section, we will show that the stepsize given by the convex combination of the long BB stepsize \(\alpha _k^{BB1}\) and the short BB stepsize \(\alpha _k^{BB2}\), i.e.,

$$\begin{aligned} \alpha _k=\gamma _k \alpha _{k}^{BB1}+(1-\gamma _k)\alpha _{k}^{BB2}, \end{aligned}$$
(15)

where \(\gamma _k\in [0,1]\), is a solution to (14). Clearly, this is a one-parametric family of stepsizes, which include the two BB stepsizes as special instances. Moreover, any stepsize lies in the interval \([\alpha _{k}^{BB2},\alpha _{k}^{BB1}]\) is a special case of the family. For example, the positive stepsize given by the geometrical mean of \(\alpha _{k}^{BB1}\) and \(\alpha _{k}^{BB2}\) [9],

$$\begin{aligned} \alpha _k^{P}=\sqrt{\alpha _{k}^{BB1}\alpha _{k}^{BB2}}=\frac{\Vert s_{k-1}\Vert }{\Vert y_{k-1}\Vert }. \end{aligned}$$
(16)

We further prove that the family of spectral gradient methods (15) is R-superlinearly convergent for two-dimensional strictly convex quadratics. For the n-dimensional case, the family is proved to be R-linearly convergent. Numerical results of the family (15) with different settings of \(\gamma _k\) are presented and compared with other gradient methods, including the BB method [2], the alternate BB method (ALBB) [11], the adaptive BB method (ABB) [40], the cyclic BB method with stepsize \(\alpha _k^{BB1}\) (CBB1) [12], the cyclic BB method with stepsize \(\alpha _k^{BB2}\) (CBB2), the cyclic method with stepsize \(\alpha _k^{P}\) (CP), the Dai-Yuan method (DY) [16], the ABBmin1 and ABBmin2 methods [23], and the SDC method [17]. The comparisons demonstrate that the proposed family is promising.

The paper is organized as follows. In Sect. 2, we show that each stepsize in the family (15) solves some least squares problem (14) and hence possesses certain quasi-Newton property. In Sect. 3, we establish R-superlinear convergence of the family (15) for two-dimensional strictly convex quadratics and R-linear convergence for the n-dimensional case, respectively. In Sect. 4, we discuss different selection rules for the parameter \(\gamma _k\). In Sect. 5, we conduct some numerical comparisons of our approach and other gradient methods. Finally, some conclusions are drawn in Sect. 6.

2 Quasi-Newton property of the family (15)

In this section, we show that each stepsize in the family (15) enjoys certain quasi-Newton property.

For the sake of simplicity, we discard the subscript of \(s_{k-1}\) and \(y_{k-1}\) in the following part of this section, i.e., \(s=s_{k-1}\), \(y=y_{k-1}\). Let

$$\begin{aligned} \phi _\tau (\alpha ):{=}\left\| \tau \left( \frac{1}{\alpha }s{-}y\right) {+}(1{-}\tau )(s{-}\alpha y)\right\| ^2. \end{aligned}$$

Then, the derivative of \(\phi _\tau (\alpha )\) with respect to \(\alpha \) is

$$\begin{aligned} \phi _\tau '(\alpha )&=2[\tau +(1-\tau )\alpha ]\left\{ \left( -\frac{\tau }{\alpha ^3}\right) s^Ts -\left[ (1-\tau )\frac{1}{\alpha } +\left( -\frac{\tau }{\alpha ^2}\right) \right] s^Ty+(1-\tau )y^Ty\right\} . \end{aligned}$$

Proposition 1

If \(s^Ty>0\) and \(\tau \in [0,1]\), the equation \(\phi '_\tau (\alpha )=0\) has a unique root in \([\alpha _k^{BB2},\alpha _k^{BB1}]\).

Proof

We only need to consider the case \(\tau \in (0,1)\). Notice that

$$\begin{aligned} \psi (\tau ,\alpha ):&=\frac{1}{2}\frac{\alpha ^3}{\tau +(1-\tau )\alpha }\phi _\tau '(\alpha ) \nonumber \\&=-\tau s^Ts-[(1-\tau )\alpha ^2-\tau \alpha ]s^Ty+(1-\tau )\alpha ^3y^Ty\nonumber \\&=(1-\tau )(\alpha ^3y^Ty-\alpha ^2s^Ty)+\tau (\alpha s^Ty-s^Ts)\nonumber \\&=(1-\tau )y^Ty(\alpha ^3-\alpha ^2\alpha _k^{BB2})+\tau s^Ty(\alpha -\alpha _k^{BB1}). \end{aligned}$$
(17)

If \(s^Ty>0\), we have \(y\ne 0\) and \(y^Ty>0\). This implies that \(\psi (\tau ,\alpha _k^{BB2})<0\) and \(\psi (\tau ,\alpha _k^{BB1})>0\). Thus, \(\psi (\tau ,\alpha )=0\) has a root in \((\alpha _k^{BB2},\alpha _k^{BB1})\). Since \(\alpha >0\), we know that the equation \(\phi '_\tau (\alpha )=0\) has a root in \([\alpha _k^{BB2},\alpha _k^{BB1}]\).

Now we show the uniqueness of such a root by contradiction. Suppose that \(\alpha _1<\alpha _2\) and \(\alpha _1,\alpha _2\in [\alpha _k^{BB2},\alpha _k^{BB1}]\) such that \(\phi '_\tau (\alpha _1)=0\) and \(\phi '_\tau (\alpha _2)=0\). It follows from (17) that

$$\begin{aligned}&(1-\tau )y^Ty(\alpha _1^3-\alpha _1^2\alpha _k^{BB2})+\tau s^Ty(\alpha _1-\alpha _k^{BB1}) \\&\quad =(1-\tau )y^Ty(\alpha _2^3-\alpha _2^2\alpha _k^{BB2})+\tau s^Ty(\alpha _2-\alpha _k^{BB1}). \end{aligned}$$

By rearranging terms, we obtain

$$\begin{aligned}&(1-\tau )y^Ty[(\alpha _1-\alpha _2)(\alpha _1^2+\alpha _1\alpha _2+\alpha _2^2) -(\alpha _1-\alpha _2)(\alpha _1+\alpha _2)\alpha _k^{BB2}]\\&\quad =\tau s^Ty(\alpha _2-\alpha _1). \end{aligned}$$

Since \(\alpha _1\ne \alpha _2\), it follows that

$$\begin{aligned} (1-\tau )y^Ty[(\alpha _1^2+\alpha _1\alpha _2+\alpha _2^2) -(\alpha _1+\alpha _2)\alpha _k^{BB2}]=-\tau s^Ty, \end{aligned}$$

which gives \((\alpha _1^2+\alpha _1\alpha _2+\alpha _2^2) -(\alpha _1+\alpha _2)\alpha _k^{BB2}<0\). This is not possible since \(\alpha _k^{BB2}\le \alpha _1<\alpha _2\). This completes the proof. \(\square \)

Proposition 2

If \(s^Ty>0\) and \(\tau \in [0,1]\), the root of \(\phi '_\tau (\alpha )=0\) in \([\alpha _k^{BB2},\alpha _k^{BB1}]\) is monotone with respect to \(\tau \).

Proof

It suffices to show the statement holds for \(\tau \in (0,1)\). By the proof of Proposition 1, \(\alpha \) is an implicit function of \(\tau \) determined by the equation \(\psi (\tau ,\alpha )=0\). The derivative of \(\psi (\tau ,\alpha )\) with respect to \(\tau \) is

$$\begin{aligned} \frac{\partial \psi (\tau ,\alpha )}{\partial \tau }=&-y^Ty(\alpha ^3-\alpha ^2\alpha _k^{BB2}) + (1-\tau )y^Ty(3\alpha ^2\cdot \alpha '-2\alpha _k^{BB2}\alpha \cdot \alpha ')\\&+\, s^Ty(\alpha -\alpha _k^{BB1})+ \tau s^Ty\alpha '. \end{aligned}$$

Let \(\frac{\partial \psi (\tau ,\alpha )}{\partial \tau }=0\). By simple calculations, we obtain

$$\begin{aligned} \alpha '=\frac{y^Ty(\alpha ^3-\alpha ^2\alpha _k^{BB2})-s^Ty(\alpha -\alpha _k^{BB1})}{(1-\tau )y^Ty(3\alpha ^2-2\alpha _k^{BB2}\alpha )+ \tau s^Ty}. \end{aligned}$$

For \(\alpha \in (\alpha _k^{BB2},\alpha _k^{BB1})\), \(\alpha '>0\). This completes the proof. \(\square \)

Theorem 1

For each \(\gamma _k\in [0,1]\), the stepsize \(\alpha _k\) defined by (15) is a solution of (14).

Proof

We only need to show that, for \(\gamma _k\in (0,1)\), \(\phi '_{\tau }(\alpha _k)\) vanishes at some \(\tilde{\tau }\in (0,1)\). From (17) and (15), we have

$$\begin{aligned} \psi (\tau ,\alpha _k)&=(1-\tau )\alpha _k^2y^Ty(\alpha _k-\alpha _k^{BB2})+\tau s^Ty(\alpha _k-\alpha _k^{BB1})\\&=(1-\tau )\gamma _k\alpha _k^2y^Ty(\alpha _k^{BB1}-\alpha _k^{BB2})+\tau (\gamma _k-1) s^Ty(\alpha _k^{BB1}-\alpha _k^{BB2})\\&=y^Ty[(1-\tau )\gamma _k\alpha _k^2+\tau (\gamma _k-1)\alpha _k^{BB2}](\alpha _k^{BB1}-\alpha _k^{BB2}). \end{aligned}$$

Clearly,

$$\begin{aligned} \tilde{\tau }=\frac{\gamma _k\alpha _k^2}{\gamma _k\alpha _k^2+(1-\gamma _k)\alpha _k^{BB2}}\in (0,1) \end{aligned}$$

is a root of \(\psi (\tau ,\alpha _k)=0\). This completes the proof. \(\square \)

3 Convergence analysis

In this section, we analyze the convergence properties of the family (15) for the quadratic function (3). Since the gradient method (2) is invariant under translations and rotations when applying to problem (3), we assume that the matrix A is diagonal, i.e.,

$$\begin{aligned} A=\text {diag}\{\lambda _1,\lambda _2,\cdots ,\lambda _n\}, \end{aligned}$$
(18)

where \(0<\lambda _1\le \lambda _2\le \cdots \le \lambda _n\).

3.1 Two-dimensional case

In this subsection, based on the techniques in [9], we establish the R-superlinear convergence of the family (15) for two-dimensional quadratic functions.

Without loss of generality, we assume that

$$\begin{aligned} A=\left( \begin{array}{cc} 1 &{} ~0 \\ 0 &{} ~\lambda \\ \end{array} \right) ,~~b=0, \end{aligned}$$

where \(\lambda \ge 1\). Furthermore, assume that \(x_1\) and \(x_2\) are such that

$$\begin{aligned} g_1^{(i)}\ne 0,~~g_2^{(i)}\ne 0,~~i=1,2. \end{aligned}$$
(19)

Let

$$\begin{aligned} q_k=\frac{(g_k^{(1)})^2}{(g_k^{(2)})^2}. \end{aligned}$$

Then it follows that

$$\begin{aligned}&\alpha _{k}^{BB1} =\frac{g_{k-1}^Tg_{k-1}}{g_{k-1}^TAg_{k-1}} =\frac{1+q_{k-1}}{\lambda +q_{k-1}},\\&\alpha _{k}^{BB2} =\frac{g_{k-1}^TAg_{k-1}}{g_{k-1}^TA^2g_{k-1}} =\frac{\lambda +q_{k-1}}{\lambda ^2+q_{k-1}}. \end{aligned}$$

Thus, the stepsize (15) can be written as

$$\begin{aligned} \alpha _k&=\gamma _k\frac{1+q_{k-1}}{\lambda +q_{k-1}}+ (1-\gamma _k)\frac{\lambda +q_{k-1}}{\lambda ^2+q_{k-1}}\nonumber \\&=\frac{\gamma _k(1+q_{k-1})(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})^2}{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})}. \end{aligned}$$
(20)

By the update rule (2) and \(g_k=Ax_k\), we have

$$\begin{aligned} g_{k+1}=(I-\alpha _kA)g_k. \end{aligned}$$

Thus,

$$\begin{aligned} (g_{k+1}^{(1)})^2&=(1-\alpha _k)^2(g_k^{(1)})^2\nonumber \\&=\frac{(\lambda -1)^2\left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)\lambda (\lambda +q_{k-1})\right] ^2}{(\lambda +q_{k-1})^2(\lambda ^2+q_{k-1})^2} (g_k^{(1)})^2, \end{aligned}$$
(21)
$$\begin{aligned} (g_{k+1}^{(2)})^2&=(1-\lambda \alpha _k)^2(g_k^{(2)})^2\nonumber \\&=\frac{q_{k-1}^2(1-\lambda )^2 \left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})\right] ^2}{(\lambda +q_{k-1})^2(\lambda ^2+q_{k-1})^2} (g_k^{(2)})^2. \end{aligned}$$
(22)

From (21), (22) and the definition of \(q_k\), we get

$$\begin{aligned} q_{k+1}=\frac{(g_{k+1}^{(1)})^2}{(g_{k+1}^{(2)})^2} =\left( \frac{\gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)\lambda (\lambda +q_{k-1})}{\gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})}\right) ^2\frac{q_k}{q_{k-1}^2}. \end{aligned}$$
(23)

Let

$$\begin{aligned} h_k(w)=\frac{\gamma _k(\lambda ^2+w)+ (1-\gamma _k)\lambda (\lambda +w)}{\gamma _k(\lambda ^2+w)+ (1-\gamma _k)(\lambda +w)}. \end{aligned}$$

Then we have

$$\begin{aligned}&h_k(0)=\frac{\gamma _k\lambda ^2+ (1-\gamma _k)\lambda ^2}{\gamma _k\lambda ^2+ (1-\gamma _k)\lambda }=\frac{\lambda }{\gamma _k\lambda +1-\gamma _k},\\&\lim _{w\rightarrow \infty }h_k(w)=\frac{\gamma _k+ (1-\gamma _k)\lambda }{\gamma _k+ 1-\gamma _k}=\gamma _k+(1-\gamma _k)\lambda . \end{aligned}$$

Since

$$\begin{aligned} h'_k(w)= \frac{\gamma _k(1-\gamma _k)\lambda (\lambda -1)^2}{(\gamma _k(\lambda ^2+w)+ (1-\gamma _k)(\lambda +w))^2}, \end{aligned}$$
(24)

we obtain \(h'(w)>0\) for \(\gamma _k\in (0,1)\). Thus,

$$\begin{aligned} h_k(w)\in \left( \frac{\lambda }{\gamma _k\lambda +1-\gamma _k},\gamma _k+(1-\gamma _k)\lambda \right) . \end{aligned}$$
(25)

Denoting \(M_k=\log q_k\). By (23), we have

$$\begin{aligned} M_{k+1}=M_k-2M_{k-1}+2\log h_k(q_{k-1}). \end{aligned}$$
(26)

Let \(\theta \) such that \(\theta ^2-\theta +2=0\). Then, \(\theta =\frac{1\pm \sqrt{7}i}{2}\). Denote by

$$\begin{aligned} \xi _k=M_k+(\theta -1)M_{k-1}. \end{aligned}$$
(27)

We have the following result.

Lemma 1

If \(\gamma _k\in (0,1)\) and

$$\begin{aligned} |\xi _2|>8\log \lambda , \end{aligned}$$
(28)

there exists \(c_1>0\) such that

$$\begin{aligned} |\xi _k|\ge \left( \sqrt{2}-1\right) 2^{\frac{k}{2}}c_1,~~k\ge 2. \end{aligned}$$
(29)

Proof

It follows from (26), the definition of \(\theta \), and (27) that

$$\begin{aligned} \xi _{k+1}=\theta M_k-2M_{k-1}+2\log h_k(q_{k-1})=\theta \xi _k+2\log h_k(q_{k-1}). \end{aligned}$$
(30)

By (25), we know that

$$\begin{aligned} 0<\log h_k(q_{k-1})<\log (\gamma _k+(1-\gamma _k)\lambda )<\log \lambda . \end{aligned}$$

Since \(|\theta |=\sqrt{2}\), we get by (30) that

$$\begin{aligned} |\xi _{k+1}|\ge \sqrt{2}\,|\xi _k|-c_1, \end{aligned}$$

where \(c_1=2\log \lambda \). From (28), we have

$$\begin{aligned} |\xi _{k+1}|&\ge 2^{\frac{k-1}{2}}|\xi _2|-\frac{2^{\frac{k}{2}}-1}{\sqrt{2}-1}c_1 \ge \left( 2^{\frac{k+3}{2}}-\frac{2^{\frac{k}{2}}-1}{\sqrt{2}-1}\right) c_1\\&=\left[ \left( \sqrt{2}-1\right) \left( 2^{\frac{k}{2}}+1\right) +2\right] c_1 >(\sqrt{2}-1)2^{\frac{k}{2}}c_1. \end{aligned}$$

This completes the proof. \(\square \)

Since \(|\theta -1|=\sqrt{2}\), we obtain by (27) that

$$\begin{aligned} |\xi _k|&\le |M_k|+|\theta -1||M_{k-1}|=|M_k|+\sqrt{2}|M_{k-1}|\\&\le (\sqrt{2}+1)\max \{|M_k|,|M_{k-1}|\}, \end{aligned}$$

which, together with (29), gives

$$\begin{aligned} \max \{|M_k|,|M_{k-1}|\}\ge \frac{1}{\sqrt{2}+1}\left( \sqrt{2}-1\right) 2^{\frac{k}{2}}c_1 =(\sqrt{2}-1)^22^{\frac{k}{2}}c_1. \end{aligned}$$
(31)

Lemma 2

Under the conditions of Lemma 1, for \(k\ge 2\), the following inequalities hold:

$$\begin{aligned}&\max _{-1\le i\le 3}M_{k+i}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1, \end{aligned}$$
(32)
$$\begin{aligned}&\min _{-1\le i\le 3}M_{k+i}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1. \end{aligned}$$
(33)

Proof

The inequality (32) holds true if

$$\begin{aligned} M_{k-1}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1 \end{aligned}$$

or

$$\begin{aligned} M_{k}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1. \end{aligned}$$

Suppose that the above two inequalities are false. By (31), we know that either

$$\begin{aligned} M_{k-1}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1 \end{aligned}$$

or

$$\begin{aligned} M_{k}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1. \end{aligned}$$

From (26), we have

$$\begin{aligned} M_{k+2}&=M_{k+1}-2M_k+2\log h(q_k)\nonumber \\&=-M_k-2M_{k-1}+2\log h(q_k)+2\log h_k(q_{k-1}). \end{aligned}$$
(34)

(i) \(M_{k-1}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1\). If \(M_{k}<0\), it follows from (34) that

$$\begin{aligned} M_{k+2}\ge -2M_{k-1}+2\log h_k(q_{k-1})\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1. \end{aligned}$$

Otherwise, if \(M_{k}\ge 0\), by (26), we get

$$\begin{aligned} M_{k+1}\ge -2M_{k-1}+2\log h_k(q_{k-1})\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1. \end{aligned}$$

(ii) \(M_{k}\le -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1\). Similar to (i), we can show that

$$\begin{aligned} M_{k+3}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1 \end{aligned}$$

or

$$\begin{aligned} M_{k+2}\ge (\sqrt{2}-1)^22^{\frac{k}{2}}c_1-2c_1. \end{aligned}$$

Thus, the inequality (32) is valid. The inequality (33) can be established in a similar way. \(\square \)

Theorem 2

If \(\gamma _k\in (0,1)\), (19) and (28) hold, the sequence \(\{\Vert g_k\Vert \}\) converges to zero R-superlinearly.

Proof

From (22), we have

$$\begin{aligned} |g_{k+1}^{(2)}|&=\frac{q_{k-1}(\lambda -1) \left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)(\lambda +q_{k-1})\right] }{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(2)}|\nonumber \\&\le \frac{q_{k-1}(\lambda -1)(\lambda ^2+q_{k-1})}{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(2)}|\nonumber \\&\le (\lambda -1)q_{k-1}|g_k^{(2)}|. \end{aligned}$$
(35)

Since \(\alpha _k\in (\lambda ^{-1},1)\), we have

$$\begin{aligned} |g_{k+1}^{(i)}|\le (\lambda -1)|g_k^{(i)}|,~~i=1,2, \end{aligned}$$
(36)

which gives

$$\begin{aligned} |g_{k+5}^{(i)}|\le (\lambda -1)^{(5-j)}|g_{k+j}^{(i)}|,~~i=1,2,~j=1,\ldots ,5. \end{aligned}$$
(37)

It follows from (35), (36) and (37) that

$$\begin{aligned} |g_{k+5}^{(2)}|&\le (\lambda -1)^{(5-j+1)}q_{k+j-2}|g_{k+j-1}^{(2)}|\nonumber \\&\le (\lambda -1)^{5}q_{k+j-2}|g_{k}^{(2)}| ,~~i=1,2,~j=1,\ldots ,5, \end{aligned}$$
(38)

which indicates that

$$\begin{aligned} |g_{k+5}^{(2)}|\le (\lambda -1)^5\left( \min _{-1\le i\le 3}q_{k+i}\right) |g_k^{(2)}|. \end{aligned}$$
(39)

As \(M_k=\log q_k\), we know by Lemma 2 and (39) that

$$\begin{aligned} |g_{k+5}^{(2)}|\le (\lambda -1)^5\exp \left( -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1\right) |g_k^{(2)}|. \end{aligned}$$

Similarly to (35), we have

$$\begin{aligned} |g_{k+1}^{(1)}|&=\frac{(\lambda -1) \left[ \gamma _k(\lambda ^2+q_{k-1})+ (1-\gamma _k)\lambda (\lambda +q_{k-1})\right] }{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(1)}|\\&\le \frac{\lambda (\lambda -1)(\lambda +q_{k-1})}{(\lambda +q_{k-1})(\lambda ^2+q_{k-1})} |g_k^{(1)}|\\&<\lambda (\lambda -1)\frac{1}{q_{k-1}}|g_k^{(1)}|, \end{aligned}$$

which together with (37) yields that

$$\begin{aligned} |g_{k+5}^{(1)}|&\le \lambda (\lambda -1)^5\frac{1}{\max \limits _{-1\le i\le 3}q_{k+i}}|g_k^{(1)}|\nonumber \\&\le \lambda (\lambda -1)^5\exp \left( -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1\right) |g_k^{(1)}|. \end{aligned}$$
(40)

By (35) and (40), for any k, we have

$$\begin{aligned} \Vert g_{k+5}\Vert \le \lambda (\lambda -1)^5\exp \left( -(\sqrt{2}-1)^22^{\frac{k}{2}}c_1+2c_1\right) \Vert g_k\Vert . \end{aligned}$$

That is, \(\{\Vert g_k\Vert \}\) converges to zero R-superlinearly. \(\square \)

Theorem 2, together with the analysis for the BB method in [2, 8], shows that, for any \(\gamma _k\in [0,1]\), the family (15) is R-superlinearly convergent.

3.2 n-dimensional case

In this subsection, we show R-linear convergence of the family (15) for n-dimensional quadratics.

Dai [7] has proved that if A has the form (18) with \(1=\lambda _1\le \lambda _2\le \cdots \le \lambda _n\) and the stepsizes of gradient method (2) have the following Property (A), then either \(g_k=0\) for some finite k or the sequence \(\{\Vert g_k\Vert \}\) converges to zero R-linearly.

Property (A)

[7] Suppose that there exist an integer m and positive constants \(M_1\ge \lambda _1\) and \(M_2\) such that

  1. (i)

    \(\lambda _1\le \alpha _k^{-1}\le M_1\);

  2. (ii)

    for any integer \(l\in [1,n-1]\) and \(\epsilon >0\), if \(G(k-j,l)\le \epsilon \) and \((g_{k-j}^{(l+1)})^2\ge M_2\epsilon \) hold for \(j\in [0,\min \{k,m\}-1]\), then \(\alpha _k^{-1}\ge \frac{2}{3}\lambda _{l+1}\).

Here,

$$\begin{aligned} G(k,l)=\sum _{i=1}^l(g_k^{(i)})^2. \end{aligned}$$

Now we show R-linear convergence of the proposed family by proving that the stepsize (15) satisfies Property (A).

Theorem 3

Suppose that the sequence \(\{\Vert g_k\Vert \}\) is generated by the family (15) applied to n-dimensional quadratics with the matrix A has the form (18) and \(1=\lambda _1\le \lambda _2\le \cdots \le \lambda _n\). Then either \(g_k=0\) for some finite k or the sequence \(\{\Vert g_k\Vert \}\) converges to zero R-linearly.

Proof

Let \(M_1=\lambda _n\) and \(M_2=2\). We have by (15) and the fact \(\gamma _k\in [0,1]\) that

$$\begin{aligned} \alpha _{k}^{BB2}\le \alpha _k\le \alpha _{k}^{BB1}. \end{aligned}$$

Thus, (i) of Property (A) holds. If \(G(k-j,l)\le \epsilon \) and \((g_{k-j}^{(l+1)})^2\ge M_2\epsilon \) hold for \(j\in [0,\min \{k,m\}-1]\), we have

$$\begin{aligned} \alpha _{k}^{-1}&\ge \frac{1}{\alpha _{k}^{BB1}}=\frac{1}{\alpha _{k-1}^{SD}} =\frac{\sum _{i=1}^n\lambda _i(g^{(i)}_{k-1})^2}{\sum _{i=1}^n(g^{(i)}_{k-1})^2}\nonumber \\&\ge \frac{\lambda _{l+1}\sum _{i=l+1}^n(g^{(i)}_{k-1})^2}{\epsilon \Vert g_{k-1}\Vert ^2+\sum _{i=l+1}^n(g^{(i)}_{k-1})^2}\nonumber \\&\ge \frac{M_2}{M_2+1}\lambda _{l+1} =\frac{2}{3}\lambda _{l+1}. \end{aligned}$$

That is, (ii) of Property (A) holds. This completes the proof. \(\square \)

4 Selecting \(\gamma _{k}\)

In this section, we present three different selection rules for the parameter \(\gamma _{k}\) of the family (15).

The simplest scheme for choosing \(\gamma _{k}\) is to fix it for all iterations. For example, we can set \(\gamma _{k}=0.1, 0.2\), etc. However, since the information carried by the two BB stepsizes changes as the iteration process going on such a fixed scheme may deteriorate the performance because it fixes the ratios of the long BB stepsize \(\alpha _{k}^{BB1}\) and the short BB stepsize \(\alpha _{k}^{BB2}\) contributed to the stepsize \(\alpha _{k}\). Thus, it is better to vary \(\gamma _{k}\) at each iteration.

One direct way for modifying \(\gamma _{k}\) is, as the randomly relaxed Cauchy method [35], randomly choose it in the interval (0, 1). But this scheme determines the value of \(\gamma _{k}\) without using any information at the current and former iterations.

The next strategy borrows the idea of cyclic gradient methods [7, 10, 12, 35], where a stepsize is reused for m iterations. Such a cyclic scheme is superior to its noncyclic counterpart in both theory and practice. Dai and Fletcher [10] showed that if the cyclic length m is greater than n / 2, the cyclic SD method is likely to be R-superlinearly convergent. Similar numerical convergence evidences were also observed for the CBB1 method in [12]. Motivated by those advantages of the cyclic scheme, for the family (15), we choose \(\gamma _{k}\) such that the current stepsize approximates the former one as much as possible. That is,

$$\begin{aligned} \gamma _{k}=\arg \min _{\gamma \in [0,1]}~\left| \gamma \alpha _{k}^{BB1}+(1-\gamma )\alpha _{k}^{BB2}-\alpha _{k-1}\right| , \end{aligned}$$

which yields

$$\begin{aligned} \gamma _{k}^{C}=\min \left\{ 1,\max \left\{ 0,\frac{\alpha _{k-1}-\alpha _{k}^{BB2}}{\alpha _{k}^{BB1}-\alpha _{k}^{BB2}}\right\} \right\} . \end{aligned}$$
(41)

Clearly, \(\gamma _{k}^{C}=0\) when \(\alpha _{k-1}\le \alpha _{k}^{BB2}\); \(\gamma _{k}^{C}=1\) when \(\alpha _{k-1}\ge \alpha _{k}^{BB1}\). This gives the following stepsize:

$$\begin{aligned} \tilde{\alpha }_{k}=\left\{ \begin{array}{ll} \alpha _{k}^{BB2}, &{}\quad \hbox {if }\alpha _{k-1}\le \alpha _{k}^{BB2}; \\ \alpha _{k}^{BB1}, &{}\quad \hbox {if }\alpha _{k-1}\ge \alpha _{k}^{BB1};\\ \alpha _{k-1}, &{} \quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$
(42)

Recall that, for quadratics, \(\alpha _{k}^{BB1}=\alpha _{k-1}^{SD}\) and \(\alpha _{k}^{BB2}=\alpha _{k-1}^{MG}\), where

$$\begin{aligned} \alpha _{k}^{MG}=\arg \min _{\alpha \in \mathbb {R}} \Vert g(x_{k}-\alpha g_{k})\Vert =\frac{g_{k}^TAg_{k}}{g_{k}^TA^2g_{k}}, \end{aligned}$$

which is a short stepsize and satisfies \(\alpha _{k}^{MG}\le \alpha _{k}^{SD}\). We refer the reader to [15] for additional details on \(\alpha _{k}^{MG}\). It follows from (42) that \(\tilde{\alpha }_{k}\) is adaptively selected and truncated by making use of the information of the former iteration. In particular, the stepsize is increased if the former one is too short (i.e., \(\alpha _{k-1}\le \alpha _{k-1}^{MG}\)) while it is decreased if the former one is too long (i.e., \(\alpha _{k-1}\ge \alpha _{k-1}^{SD}\)). Moreover, the former stepsize will be reused if it lies in \([\alpha _{k}^{BB2},\alpha _{k}^{BB1}]\). Thus, (42) is an adaptive truncated cyclic scheme and we will call it ATC for short.

As cyclic methods, we need to update the stepsize every m iterations to avoid using a stepsize for too many iterations. Many different stepsizes can be employed. Here, we suggest three candidates. The first is the long BB stepsize \(\alpha _{k}^{BB1}\), i.e.,

$$\begin{aligned} \alpha _{k}^{ATC1}=\left\{ \begin{array}{ll} \alpha _{k}^{BB1}, &{} \quad \hbox {if }mod(k,m)=0;\\ \tilde{\alpha }_{k}, &{} \quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$
(43)

The second is the short BB stepsize \(\alpha _{k}^{BB2}\), i.e.,

$$\begin{aligned} \alpha _{k}^{ATC2}=\left\{ \begin{array}{ll} \alpha _{k}^{BB2}, &{} \quad \hbox {if }mod(k,m)=0;\\ \tilde{\alpha }_{k}, &{}\quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$
(44)

The last is \(\alpha _{k}^P\) given by (16), which is a special case of the family (15). That is,

$$\begin{aligned} \alpha _{k}^{ATC3}=\left\{ \begin{array}{ll} \alpha _{k}^P, &{}\quad \hbox {if }mod(k,m)=0;\\ \tilde{\alpha }_{k}, &{}\quad \hbox {otherwise.} \end{array} \right. \end{aligned}$$
(45)

In what follows we shall refer (43), (44) and (45) to as ATC1, ATC2 and ATC3, respectively.

We tested the family (15) with fixed \(\gamma _{k}\) on quadratic problems to see how the values of \(\gamma _{k}\) affect the performance. In particular, \(\gamma _{k}\) is set to 0.1, 0.3, 0.5, 0.7 and 0.9 for all k, respectively. The examples in [15, 24, 40] were employed, where \(A=QVQ^T\) with

$$\begin{aligned} Q=(I-2w_3w_3^T)(I-2w_2w_2^T)(I-2w_1w_1^T), \end{aligned}$$

and \(w_1\), \(w_2\), and \(w_3\) are unitary random vectors, \(V=diag(v_1,\ldots ,v_n)\) is a diagonal matrix where \(v_1=1\), \(v_n=\kappa \), and \(v_j\), \(j=2,\ldots ,n-1\), are randomly generated between 1 and \(\kappa \). We stopped the iteration if the number of iteration exceeds 20,000 or

$$\begin{aligned} \Vert g_k\Vert \le \epsilon \Vert g_1\Vert \end{aligned}$$
(46)

is satisfied for some given \(\epsilon \).

Four values of the condition number \(\kappa \): \(10^3, 10^4, 10^5, 10^6\) as well as three values of \(\epsilon \): \(10^{-6}, 10^{-9}, 10^{-12}\) were used. For each value of \(\kappa \) and \(\epsilon \), 10 instances with \(v_j\) evenly distributed in \([1,\kappa ]\) were generated. For each instance, the entries of b were randomly generated in \([-10,10]\) and the vector \(e=(1,\ldots ,1)^T\) was used as the starting point.

The BB method was also run for comparison, i.e. \(\gamma _{k}=1\). We compared the performance of the algorithms by the required number of iterations, as described in [21]. In other words, for each method, we plot the ratio of problems for which the method is within a factor \(\rho \) of the least iterations. For the 100-dimensional case, we can see from Fig. 1 that the performance of the family improves as \(\gamma _{k}\) becomes larger. However, for the 1000-dimensional case, Fig. 2 shows that the family (15) with \(\gamma _{k}=0.7\) or 0.9 can outperform the BB method for some \(\rho \) around 1.5. That is, for some problems, the long BB stepsize \(\alpha _{k}^{BB1}\) may not be the best choice in the family.

Fig. 1
figure 1

Performance profile of the family (15) with fixed \(\gamma _{k}\) based on number of iterations for 100-dimensional problems

Fig. 2
figure 2

Performance profile of the family (15) with fixed \(\gamma _{k}\) based on number of iterations for 1000-dimensional problems

Then, we applied ATC1, ATC2 and ATC3 to the above problem with \(n=1000\) and different \(v_j\) distributions. Particularly, two sets were generated: (1) \(v_j\) are evenly distributed in \([1,\kappa ]\) as the former example; (2) 20% of \(v_j\) are evenly distributed in [1, 100] and others are in \([\frac{\kappa }{2},\kappa ]\). We used three values of the condition number \(\kappa =10^4, 10^5, 10^6\) and also three values of \(\epsilon \) as the above problem. Other settings were same as the former example.

Table 1 Number of iterations of ATC1, ATC2 and ATC3 with different m on problems with different spectrum distributions

We tested the three methods with different m. The average number of iterations are presented in Table 1. We can see that, for each \(\kappa \) and m, ATC1 often outperforms ATC2 and ATC3. The performances of the three methods do not improve as m increases. For the first set of problems, ATC1 with \(m=30\) performs better than other values. For the second set of problems, ATC1 with \(m=8\) dominates others. Thus, in the next section we only run ATC1 using these settings.

We close this section by mentioning that there are many other different rules for computing the parameter \(\gamma _{k}\). For example, as the alternate gradient method [7, 11], we can choose \(\gamma _{k}\) to alternate short stepsizes and long stepsizes. In addition, we can also use sophisticated random schemes for \(\gamma _{k}\), see [30] and references therein.

5 Numerical results

In this section, we present numerical results of the family (15) with different settings of the parameter \(\gamma _k\). We compare the performance of the ATC1 method with the following methods: the family (15) with \(\gamma _k\) randomly chose in (0, 1) (RAND), BB [2], ALBB [11], ABB [40], CBB1 [12], CBB2, CP, Dai-Yuan (DY) [16], ABBmin1 and ABBmin2 [23], SDC [17], and the family (15) with basic adaptive truncated cyclic scheme (42) (ATC). Since the SDC method performs better than its monotone counterpart for most problems, we only run SDC.

All methods were implemented in MATLAB (v.9.0-R2016a). All the runs were carried out on a PC with an Intel Core i7, 2.9 GHz processor and 8 GB of RAM running Windows 10 system. Moreover, we stopped the iteration if the number of iteration exceeds 20,000 or (46) is satisfied for some given \(\epsilon \).

Firstly, we considered randomly generated problems in the former section with different \(v_j\) distributions as shown in Table 2. The first two sets are same as the former section. For the third set, half of \(v_j\) are in [1, 100] and the other half in \([\frac{\kappa }{2},\kappa ]\); for the fourth set, 80% of \(v_j\) are evenly distributed in [1, 100] and others are in \([\frac{\kappa }{2},\kappa ]\). The fifth set has 20% of \(v_j\) are in [1, 100], 20% of \(v_j\) are in \([100,\frac{\kappa }{2}]\) and the others in \([\frac{\kappa }{2},\kappa ]\). The last two sets only has either 10 small \(v_j\) or 10 large \(v_j\). Other settings were also same as the former section.

For the three cyclic methods: CBB1, CBB2 and CP, the best m among \(\{3,4,\ldots ,10\}\) was chosen. In particular, \(m=3\) for CBB1 and \(m=4\) for CBB2 and CP. As in [23], the parameters used by the ABBmin1 and ABBmin2 methods are set to \(\tau =0.8\), \(m=9\) and \(\tau =0.9\), respectively. While \(\tau =0.1\) was used for the ABB method which is better than 0.15 and 0.2 in our test. The pair (hm) of the SDC method was set to (8, 6) which is better than other choices. As for our ATC1 method, we set \(m=30\) for the first and fifth sets of problems and \(m=8\) for other sets.

Table 2 Distributions of \(v_j\)

It can be observed from Table 3 that, the rand scheme performs worse than the fixed scheme with \(\gamma _k=1\), i.e., the BB method. The ATC method is competitive with the BB method and dominates the ALBB and CP methods for most test problems. The CBB2 method clearly outperforms the other two cyclic methods: CBB1 and CP. Moreover, the CBB2 method performs much better than the ATC and ABB methods except the first and fifth sets of problems. The CBB2 method is even faster than the DY method. Although the ABBmin2 method is the fastest one for solving the first and sixth sets of problems, it is worse than the ABBmin1, SDC and ATC1 methods for the other five sets of problems. Our ATC1 method is faster than the CBB2 method except the last set of problems. In addition, the ATC1 method is much better than the ABBmin1 and SDC methods on the first set of problems and comparable to them on the other sets of problems. Furthermore, for each tolerance, our ATC1 method is the fastest one in the sense of total number of iterations.

Table 3 Number of iterations of compared methods on problems in Table 2

Then, the compared methods were applied to the non-rand quadratic minimization problem in [17] which is more difficult than its rand counterpart. In particular, A is diagonal whose elements are given by

$$\begin{aligned} A_{jj}=10^{\frac{ncond}{n-1}(n-j)}, ~~~j=1,\ldots ,n, \end{aligned}$$
(47)

where \(ncond=\log _{10} \kappa \), and the vector b is null. We tested 10,000-dimensional problems with three different condition numbers: \(\kappa =10^4, 10^5, 10^6\). The stopping condition (46) was employed with \(\epsilon =10^{-6}, 10^{-9}, 10^{-12}\) for all methods. For each value of \(\kappa \) or \(\epsilon \), 10 different starting points with entries in \([-10,10]\) were randomly generated.

Due to the performances of these compared methods on the above problems, only fast methods were tested, i.e., ABB, CBB2, DY, ABBmin1, ABBmin2, SDC, ATC and ATC1. The numbers of iterations averaged over those starting points of each method are listed in Table 4. Here, the results of the SDC method were obtained with the best choice of parameter setting in [17], i.e., \(h=30\) and \(m=2\).

From Table 4 we can see that the ATC method is slightly better than the CBB2 method and comparable to the DY method for most problems. The ABB method performs better than the ABBmin1 method. Although the ABB method is slower than the ABBmin2 method when the tolerance is low, it wins if a tight tolerance is required. Our ATC1 method is competitive with other methods especially for a tight tolerance. Moreover, our ATC1 method takes least total number of iterations to meet the required tolerance.

Table 4 Number of iterations of compared methods on non-rand quadratic problems

6 Conclusions

We have proposed a family of spectral gradient methods which calculates the stepsize by the convex combination of the long BB stepsize and the short BB stepsize, i.e., \(\alpha _k=\gamma _k\alpha _k^{BB1}+(1-\gamma _k)\alpha _k^{BB2}\). Similar to the two BB stepsizes, each stepsize in the family possesses certain quasi-Newton property. In addition, R-superlinear and R-linear convergence of the family were established for two-dimensional and n-dimensional strictly convex quadratics, respectively. Furthermore, with different choices of the parameter \(\gamma _k\), we obtain different stepsizes and gradient methods. The family provides us an alternative for the stepsize of gradient methods which can be easily extended to general problems by incorporating some line searches.

Since the parameter \(\gamma _k\) affects the value of \(\alpha _k\) and hence the efficiency of the method, it is interesting to investigate how to choose a proper \(\gamma _k\) to achieve satisfactory performance. We have proposed and tested three different selection rules for \(\gamma _k\), among which the adaptive truncated cyclic scheme with the long BB stepsize \(\alpha _k^{BB1}\), i.e., the ATC1 method performs best. In addition, our ATC1 method is comparable to the state-of-the-art gradient methods including the Dai-Yuan, ABBmin1, ABBmin2 and SDC methods. One interesting question is how to design an efficient scheme to choose m adaptively for the proposed method. This will be our future work.