1 Introduction

The gradient method is well-known for minimizing a smooth function \(f(x): \mathbb {R}^{n} \rightarrow \mathbb {R}\), which updates the iterates by

$$\begin{aligned} x_{k+1}=x_{k}-\alpha _{k}g_{k}, \end{aligned}$$
(1)

where \(g_{k} = \nabla f\left( x_{k}\right)\) and the stepsize \(\alpha _{k}>0\) depends on the method under consideration. The classic steepest descent (SD) [1] and minimal gradient (MG) [2] methods determine \(\alpha _{k}\) by minimizing \(f\left( x_{k}-\alpha g_{k}\right)\) and \(\left\| g\left( x_{k}-\alpha g_{k}\right) \right\| _2\), respectively. Theoretically, the SD and MG methods will asymptotically perform zigzags between two directions, which often yield poor performance in many probelms [3,4,5,6,7].

In 1988, Barzilai and Borwein [8] proposed the following two novel stepsizes from the view of quasi-Newton methods,

$$\begin{aligned} \alpha _{k}^{B B 1}=\arg \min _{\alpha \in \mathbb {R}}\left\| \alpha ^{-1} s_{k-1}-y_{k-1}\right\| _{2}=\frac{s_{k-1}^{T} s_{k-1}}{s_{k-1}^{T} y_{k-1}} \end{aligned}$$
(2)

and

$$\begin{aligned} \alpha _{k}^{B B 2}=\arg \min _{\alpha \in \mathbb {R}}\left\| s_{k-1}-\alpha y_{k-1}\right\| _{2}=\frac{s_{k-1}^{T} y_{k-1}}{y_{k-1}^{T} y_{k-1}}, \end{aligned}$$
(3)

where \(s_{k-1}=x_{k}-x_{k-1}\) and \(y_{k-1}=g_{k}-g_{k-1}\). Apparently, \(\alpha _{k}^{BB1} \ge \alpha _{k}^{BB2}\) follows from the Cauchy–Schwartz inequality if \(s_{k-1}^{T}y_{k-1} > 0\). When f(x) is a quadratic function

$$\begin{aligned} f\left( x\right) =\frac{1}{2}x^{T}Ax-b^{T}x, \end{aligned}$$
(4)

where \(A\in \mathbb {R}^{n\times n}\) is a real symmetric positive definite matrix and \(b\in \mathbb {R}^{n}\), \(\alpha _{k}^{B B 1}\) and \(\alpha _{k}^{B B 2}\) can be regarded as the SD and MG stepsizes with retard, respectively, i.e.

$$\begin{aligned} \alpha _{k}^{B B 1}=\frac{g_{k-1}^{T} g_{k-1}}{g_{k-1}^{T} A g_{k-1}}=\alpha _{k-1}^{SD} \quad \text{ and } \quad \alpha _{k}^{B B 2}=\frac{g_{k-1}^{T} A g_{k-1}}{g_{k-1}^{T} A^{2} g_{k-1}}=\alpha _{k-1}^{MG}. \end{aligned}$$

Barzilai and Borwein [8] proved that the BB method is R-superlinearly convergent for the two dimensional strictly convex quadratic function. It has been shown that the BB method is globally convergent [9] with R-linear rate [10] for any dimensional cases. Although the BB method is nonmonotone, extensive numerical experimental results indicate that it performs much better than the SD method [11,12,13]. See [14, 15, 16, 17, 18, 19, 20 and 21] for more BB-like methods.

In [22], Yuan derived a new stepsize, which together with the SD method produces the minimizer of a two dimensional strictly convex quadratic function in three iterations. In what follows, if a method can give the exact minimizer of a two dimensional convex quadratic function within finite iterations, we call it has the property of two dimensional quadratic termination. Based on the following variant of the Yuan stepsize,

$$\begin{aligned} \alpha _{k}^{\mathrm {DY}}=\frac{2}{\sqrt{\left( \frac{1}{\alpha _{k-1}^{\mathrm {SD}}}- \frac{1}{\alpha _{k}^{\mathrm {SD}}}\right) ^{2}+\frac{4\left\| g_{k}\right\| _2^{2}}{\left( \alpha _{k-1}^{\mathrm {SD}}\right) ^{2}\left\| g_{k-1}\right\| _2^{2}}+ \frac{1}{\alpha _{k-1}^{\mathrm {SD}}}+\frac{1}{\alpha _{k}^{\mathrm {SD}}}}}, \end{aligned}$$
(5)

Dai and Yuan [23] suggested the so-called Dai–Yuan (DY) gradient method with

$$\begin{aligned} \alpha _{k} =\left\{ \begin{array}{ll} \alpha _{k}^{S D}, &{} \text{ if } \bmod (k, 4)<2; \\ \alpha _{k}^{D Y}, &{} \text{ otherwise. } \end{array}\right. \end{aligned}$$
(6)

Clearly, \(\alpha _{k}^{DY}\le \min \{\alpha _{k}^{SD},\alpha _{k-1}^{SD}\}\), which implies that the DY method is monotone. Interestingly, the DY method can even outperform the nonmonotone BB method. Recently, Huang et al. [5] derived a new stepsize, say \(\alpha _{k}^{H}\), such that the gradient method

$$\begin{aligned} \alpha _{k} = \frac{g_{k}^{T} \psi (A) g_{k}}{g_{k}^{T} \psi (A) A g_{k}} \end{aligned}$$

together with \(\alpha _{k}^{H}\) achieves the two dimensional quadratic termination, where \(\psi\) is a real analytic function on \([\lambda _{1},\lambda _{n}]\) and can be expressed by Laurent series

$$\begin{aligned} \psi (z)=\sum _{k=-\infty }^{\infty } d_{k} z^{k},~d_{k} \in \mathbb {R}, \end{aligned}$$
(7)

such that \(0<\sum _{k=-\infty }^{\infty } d_{k} z^{k}<+\infty\) for all \(z \in \left[ \lambda _{1}, \lambda _{n}\right]\). Here, \(\lambda _{1}\) and \(\lambda _{n}\) are the smallest and largest eigenvalues of A, respectively. Furthermore, \(\alpha _{k}^{H}\) reduces to \(\alpha _{k}^{DY}\) when \(\psi (A)=I\). The property of two dimensional quadratic termination has shown great potential in improving performances of gradient methods, see [24, 5, 25] for example.

To our knowledge, there is still lack of theoretical analysis for the two dimensional quadratic termination of the Dai–Yang method [26] whose stepsize is given by

$$\begin{aligned} \alpha _{k}^{P}=\frac{\left\| g_{k}\right\| _2}{\left\| A g_{k}\right\| _2}. \end{aligned}$$
(8)

A remarkable property of the Dai–Yang method is that \(\alpha _{k}^{P}\) converges to the optimal stepsize \(\frac{2}{\lambda _{1}+\lambda _{n}}\) (in the sense that it minimizes the modulus \(\Vert I-\alpha A\Vert _{2}\), see [26, 27]). Moreover, the Dai–Yang method is able to find the eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\).

In this paper, for a uniform analysis, we consider equipping the family

$$\begin{aligned} \alpha _{k}=\frac{\left\| \psi (A)g_{k}\right\| _2}{\left\| \psi (A)A g_{k}\right\| _2} \end{aligned}$$
(9)

with the two dimensional quadratic termination property, which will be achieved by cooperating with

$$\begin{aligned} \tilde{\alpha }_{k}=\arg \max _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}= \arg \max _{\alpha _{k} \in \mathbb {R}}~\frac{\left\| \psi (A)\left( I-\alpha _{k}A\right) g_{k}\right\| _2}{\left\| \psi (A)A \left( I-\alpha _{k}A\right) g_{k}\right\| _2}. \end{aligned}$$
(10)

The above strategy of maximizing the stepsize value in the next iteration has been employed in [28] for the SD method. However, the analysis in [28] can not be directly applied to the family (9). Clearly, \(\alpha _{k}^{P}\) corresponds to the case \(\psi (A) = I\) in (9). We prove that each method in the family (9) will asymptotically alternate in a two dimensional subspace associated with the two eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\). In addition, for any given \(\psi\), the stepsize (9) tends to the above optimal stepsize as \(k\rightarrow \infty\), and the eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\) can be obtained. Then, we show that \(\lim _{k \rightarrow \infty }\tilde{\alpha }_{k} =1 /\lambda _{n}\) for n dimensional strictly convex quadratics. By adaptively taking \(\alpha _{k}^{BB1}\) and reusing \(\tilde{\alpha }_{k-1}\) for some iterations, we propose a new method for quadratic minimization problems. It is proved that the proposed method is R-linearly convergent with the rate of \(1-1/\kappa\), where \(\kappa =\lambda _{n} / \lambda _{1}\) is the condition number of A. Our numerical comparisons with the BB1 [8], DY [23], SL (Alg.1 in [25]), ABBmin2 [28], SDC [29], and MGC [7] methods for solving unconstrained random and non-random quadratic optimization demonstrate that the proposed method is very efficient. Further, numerical experiments on quadratic problems whose Hessians are chosen from the SuiteSparse Matrix Collection [30] suggest that the proposed method is very competitive with the above methods.

The paper is organized as follows. In Sect. 2, we derive the new stepsize \(\tilde{\alpha }_{k}\) and analyze its properties. The asymptotic behavior of the family (9) is also analyzed. Our new algorithm for quadratic minimization problems as well as its R-linear convergence are presented in Sect. 3. Section 4 presents some numerical comparisons of the proposed method and other successful gradient methods on solving quadratic problems. Finally, in Sect. 5 we give some concluding remarks.

2 A new stepsize and its properties

In this section, we derive the formula of \(\tilde{\alpha }_{k}\) and analyze its properties.

To obtain \(\tilde{\alpha }_{k}\), we consider

$$\begin{aligned} F(\alpha _{k}):=\alpha _{k+1}^{2}= \frac{\left\| \psi (A)\left( I-\alpha _{k}A\right) g_{k}\right\| _2^{2}}{\left\| \psi (A)A \left( I-\alpha _{k}A\right) g_{k}\right\| _2^{2}}. \end{aligned}$$
(11)

The maximum value of \(F(\alpha _{k})\) is achieved when \(F'(\alpha _{k}) = 0\), which holds for any \({\alpha }_{k}\) satisfying

$$\begin{aligned} \phi _{1} \alpha _{k}^{2}-\phi _{2} \alpha _{k}+\phi _{3}=0, \end{aligned}$$
(12)

where \(\phi _{1}=c_{1} c_{4}-c_{2} c_{3}\), \(\phi _{2}=c_{0} c_{4}-c_{2}^{2}\) and \(\phi _{3}=c_{0} c_{3}-c_{1}c_{2}\) with

$$\begin{aligned} c_{j} = g_{k}^{T} A^{j} \psi ^{2}(A) g_{k}, \qquad j = 0,1,2,3,4. \end{aligned}$$
(13)

The following lemma guarantees that (12) has two roots.

Lemma 1

Assume \(g_{k} \ne 0\) and \(g_{k}\) is not parallel to \(Ag_{k}\). Then, \(\phi _{1}, \phi _{2}, \phi _{3} > 0\) and \(\phi _{2}^2 - 4\phi _{1}\phi _{3} > 0\).

Proof

It follows from the Cauchy-Schwartz inequality and (13) that

$$\begin{aligned} \begin{aligned} \left\| \psi (A)A^{\frac{j}{2}}g_{k}\right\| _2^2 \left\| \psi (A)A^{\frac{j+2}{2}}g_{k}\right\| _2^2&>\left( \left( \psi (A)A^{\frac{j}{2}}g_{k}\right) ^T\left( \psi (A)A^{\frac{j+2}{2}}g_{k}\right) \right) ^2\\&=\left( g_{k}^{T}\psi ^{2}(A)A^{j+1}g_{k}\right) ^{2} \end{aligned} \end{aligned}$$

for \(j\ge 0\). That is, \(c_{j} / c_{j+1} > c_{j+1}/ c_{j+2}\), which implies \(\phi _{1}, \phi _{2}, \phi _{3} > 0.\)

By direct calculation, we obtain \(c_{1} \phi _{2} = c_{0} \phi _{1} + c_{2} \phi _{3}\), which implies that

$$\begin{aligned} c_{1}^{2}\phi _{2}^{2} =\left( c_{0} \phi _{1} + c_{2} \phi _{3}\right) ^{2} \ge 4c_{0}c_{2}\phi _{1} \phi _{3}. \end{aligned}$$

Combining with \(c_{0} c_{2}>c_{1}^{2}\), we have \(\phi _{2}^2 - 4\phi _{1}\phi _{3} > 0\). This completes the proof.\(\square\)

Using the square root law, we get the two roots of (12) as

$$\begin{aligned} \tilde{\alpha }_{k} = \frac{\phi _{2} - \sqrt{\phi _{2}^{2}-4 \phi _{1} \phi _{3}}}{2 \phi _{1}}=\frac{2}{\frac{\phi _{2}}{\phi _{3}} + \sqrt{\left( \frac{\phi _{2}}{\phi _{3}}\right) ^{2}-4 \frac{\phi _{1}}{\phi _{3}}}} \end{aligned}$$
(14)

and

$$\begin{aligned} \hat{\alpha }_{k} = \frac{\phi _{2} + \sqrt{\phi _{2}^{2}-4 \phi _{1} \phi _{3}}}{2 \phi _{1}}= \frac{2}{\frac{\phi _{2}}{\phi _{3}} - \sqrt{\left( \frac{\phi _{2}}{\phi _{3}}\right) ^{2}-4 \frac{\phi _{1}}{\phi _{3}}}}. \end{aligned}$$
(15)

It is easy to see that \(\tilde{\alpha }_{k} = \arg \max _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}\) and \(\hat{\alpha }_{k} =\arg \min _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}\).

Next theorem presents the two dimensional quadratic termination of the gradient method using \(\alpha _{k}\) in (9) and \(\tilde{\alpha }_{k}\).

Theorem 1

(Two dimensional quadratic termination) Consider the gradient method (1) for minimizing the two dimensional quadratic function (4). If the stepsize \(\alpha _{k}\) is given by (9) for all \(k \ne k_{0}\) and \(\alpha _{k_{0}}=\tilde{\alpha }_{k_{0}}\) at the \(k_{0}\)-th iteration where \(k_{0} \ge 1\), it holds that \(g_{k_{0}+i}=0\) for some \(1 \le i \le 3\).

Proof

Without loss of generality, we assume that \(A={\text {diag}}\{1, \lambda \}\) with \(\lambda >0\). Let \(g_{k}^{(1)}\) and \(g_{k}^{(2)}\) be the first and second components of \(g_{k}\), respectively. Notice that

$$\begin{aligned} c_{j}=g_{k}^{T} \psi ^{2}(A) A^{j} g_{k}=\left( g_k^{(1)}\right) ^{2} \psi ^{2}(1)+\lambda ^{j}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(\lambda ). \end{aligned}$$

After direct calculation and simplification, we get

$$\begin{aligned} \begin{aligned} \phi _{1}&=\left( g_k^{(1)}\right) ^{2}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(1) \psi ^{2}(\lambda )(\lambda -1)^{2}(\lambda +1) \lambda , \\ \phi _{2}&=\left( g_k^{(1)}\right) ^{2}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(1) \psi ^{2}(\lambda )(\lambda -1)^{2}(\lambda +1)^{2}, \\ \phi _{3}&=\left( g_k^{(1)}\right) ^{2}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(1) \psi ^{2}(\lambda )(\lambda -1)^{2}(\lambda +1). \end{aligned} \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \frac{\phi _{1}}{\phi _{3}} = \lambda \quad \text {and} \quad \frac{\phi _{2}}{\phi _{3}} = \lambda +1. \end{aligned}$$

Thus, from (14) we know that \(\tilde{\alpha }_{k} = 1/\lambda\) for all \(k\ge 1\). The conclusion follows immediately from \(\alpha _{k_{0}}=\tilde{\alpha }_{k_{0}}\) and \(g_{k+1}=\left( I-\alpha _{k}A\right) g_{k}\). We complete the proof.\(\square\)

In what follows, we shall prove that \(\tilde{\alpha }_{k}\) converges to \(1/\lambda _{n}\) under each method in the family (9). To this aim, we have to analyze the asymptotic behavior of the family (9) first.

For convenience, we assume without loss of generality that the matrix A is diagonal with distinct eigenvalues, i.e.

$$\begin{aligned} A=\mathrm {diag}\{\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\}, \quad 0<\lambda _{1}<\lambda _{2}<\ldots <\lambda _{n}. \end{aligned}$$
(16)

Let \(\left\{ \xi _{1}, \xi _{2}, \ldots , \xi _{n}\right\}\) be the set of orthogonal eigenvectors associated with the eigenvalues \(\left\{ \lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\right\}\). Denoting by \(\mu _{k}^{(i)}, i=1, \ldots , n\), the components of \(g_{k}\) along \(\xi _{i}\), i.e.

$$\begin{aligned} g_{k}=\sum _{i=1}^{n} \mu _{k}^{(i)} \xi _{i}. \end{aligned}$$
(17)

It follows from (1) and (17) that

$$\begin{aligned} g_{k+1}=\left( I-\alpha _{k}A\right) g_{k}=\prod _{j=0}^{k}\left( I-\alpha _{j} A\right) g_{0}=\sum _{i=1}^{n} \mu _{k+1}^{(i)} \xi _{i}, \end{aligned}$$
(18)

where

$$\begin{aligned} \mu _{k+1}^{(i)}=\left( 1-\alpha _{k} \lambda _{i}\right) \mu _{k}^{(i)}=\mu _{0}^{(i)} \prod _{j=0}^{k}\left( 1-\alpha _{j} \lambda _{i}\right) . \end{aligned}$$
(19)

The following lemma is useful in our analysis.

Lemma 2

[26] Let p be a vector in \(\mathbb {R}^{n}\) such that (i) \(p^{(1)}>0\) and \(p^{(n)}>0\) (ii) \(p^{(1)} + p^{(n)} = 1\). Further assume that \(0<\lambda _{1}<\cdots <\lambda _{n}\). Consider a transformation T such that

$$\begin{aligned} \left( Tp\right) ^{(i)}=\frac{\left( \lambda _{i}-\gamma (p)\right) ^{2} p^{(i)}}{\sum _{i}\left( \lambda _{i}-\gamma (p)\right) ^{2} p^{(i)}}, \quad \text{ where }\quad \gamma (p)=\sqrt{\sum _{i} \lambda _{i}^{2} p^{(i)}}. \end{aligned}$$

Then

$$\begin{aligned} \lim _{k \rightarrow \infty } T^{k}p= {\left\{ \begin{array}{ll} h_{1}, &{} \text{ if } i=1, \\ 0, &{} \text{ if } i=2, \ldots , n-1, \\ h_{2}, &{} \text{ if } i=n,\end{array}\right. } \end{aligned}$$

where

$$\begin{aligned} h_{1}=\frac{\lambda _{1}+3 \lambda _{n}}{4\left( \lambda _{1}+\lambda _{n}\right) } \quad \text{ and } \quad h_{2}=\frac{3 \lambda _{1}+ \lambda _{n}}{4\left( \lambda _{1}+\lambda _{n}\right) }. \end{aligned}$$
(20)

Based on Lemma 2, we are able to show that each method in the family (9) will zigzag in a two dimensional subspace which generalizes the results in [26], where the case \(\psi (A)=I\) (i.e. the Dai–Yang method) is considered.

Theorem 2

Assume that the starting point \(x_{0}\) is such that

$$\begin{aligned} g_{0}^{T} \xi _{1} \ne 0 \quad \text{ and } \quad g_{0}^{T} \xi _{n} \ne 0. \end{aligned}$$

Let \(\{x_{k} \}\) be the sequence generated by applying a method in the family (9). Then

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (\lambda _{i})\mu _{2k}^{(i)}}{\sqrt{{\sum _{j=1}^{n}\left( \psi (\lambda _{i})\mu _{2k}^{(i)}\right) ^{2}}}}=\left\{ \begin{array}{ll} \mathrm {sign}\left( \psi (\lambda _{1})\mu _{2k}^{(1)}\right) \sqrt{h_{1}}, &{} \text{ if } i=1, \\ 0, &{} \text{ if } i=2, \ldots , n-1, \\ \mathrm {sign}\left( \psi (\lambda _{n})\mu _{2k}^{(n)}\right) \sqrt{h_{2}}, &{} \text{ if } i=n, \end{array}\right. \end{aligned}$$
(21)

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (\lambda _{i})\mu _{2k+1}^{(i)}}{\sqrt{\sum _{j=1}^{n}\left( \psi (\lambda _{i})\mu _{2k+1}^{(i)}\right) ^{2}}}= \left\{ \begin{array}{ll} \mathrm {sign}\left( \psi (\lambda _{1})\mu _{2k+1}^{(1)}\right) \sqrt{h_{1}}, &{} \text{ if } i=1, \\ 0, &{} \text{ if } i=2, \ldots , n-1, \\ \mathrm {-sign}\left( \psi (\lambda _{n})\mu _{2k+1}^{(n)}\right) \sqrt{h_{2}}, &{} \text{ if } i=n, \end{array}\right. \end{aligned}$$
(22)

where \(h_{1}\) and \(h_{2}\) are defined in (20). Further, the vectors

$$\begin{aligned} \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}+\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}} ~~~\text {and}~~~ \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}-\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}} \end{aligned}$$

tend to be the eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\) of A, respectively.

Proof

By (17), we get

$$\begin{aligned} \psi (A)g_{k}=\sum _{i=1}^{n}\eta _{k}^{(i)} \xi _{i}, \end{aligned}$$
(23)

where \(\eta _{k}^{(i)} = \psi (\lambda _{i}) \mu _{k}^{(i)}\), which together with (19) yields

$$\begin{aligned} \eta _{k+1}^{(i)} = \left( 1-\alpha _{k}\lambda _{i}\right) \eta _{k}^{i}. \end{aligned}$$
(24)

Defining the vector \(p_{k}=\left( p_{k}^{(i)}\right)\) with

$$\begin{aligned} p_{k}^{(i)}=\frac{\left( \eta _{k}^{(i)}\right) ^{2}}{\left\| \eta _{k}\right\| _{2}^{2}} \end{aligned}$$
(25)

and

$$\begin{aligned} \gamma _{k}= \alpha _{k}^{-1}=\frac{\left\| \psi (A)A g_{k}\right\| _{2}}{\left\| \psi (A)g_{k}\right\| _{2}} =\sqrt{\sum _{i} \lambda _{i}^{2} p_{k}^{(i)}}. \end{aligned}$$
(26)

We have from (24), (25) and (26) that

$$\begin{aligned} p_{k+1}^{(i)}=\frac{\left( \lambda _{i}-\gamma _{k}\right) ^{2} p_{k}^{(i)}}{\sum _{i}\left( \lambda _{i}-\gamma _{k}\right) ^{2} p_{k}^{(i)}}. \end{aligned}$$

Clearly, according to the definition of \(p_{k}\), we get that \(p_{k}^{(i)}\ge 0\) for all i and

$$\begin{aligned} \sum _{i} p_{k}^{(i)}=1, \quad \text{ for } \text{ all } k. \end{aligned}$$

Let \(p=p_{1} \in \mathbb {R}^{n}\), based on the above analysis and Lemma 2, we know that \(\lim _{k \rightarrow \infty } p_{k}=\left( h_{1}, 0, \ldots , 0, h_{2}\right) ^{T}\), where \(h_{1}\) and \(h_{2}\) are given in (20). It follows from (24) and \(\lambda _{n}^{-1}<\alpha _{k}<\lambda _{1}^{-1}\) that

$$\begin{aligned} \mathrm {sign}\left( \psi (\lambda _{1})\mu _{k+1}^{(1)}\right) =\mathrm {sign}\left( \psi (\lambda _{1})\mu _{k}^{(1)}\right) ~ \end{aligned}$$
(27)

and

$$\begin{aligned} \mathrm {sign}\left( \psi (\lambda _{n})\mu _{k+1}^{(n)}\right) =-\mathrm {sign}\left( \psi (\lambda _{n})\mu _{k}^{(n)}\right) . \end{aligned}$$
(28)

Thus, by Lemma 2, (27) and (28), we know that (21) and (22) hold.

Furthermore, combining (23), (27) and (28), we find that

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}+\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}}=2 \mathrm {sign}\left( \psi (\lambda _{1})\mu _{2k}^{(1)}\right) \sqrt{h_{1}} \xi _{1} \end{aligned}$$

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}-\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}}=\pm 2 \sqrt{h_{2}} \xi _{n}. \end{aligned}$$

This completes our proof.\(\square\)

From Theorem 2, we have the following asymptotic result of the stepsize (9).

Corollary 1

Under the conditions of Theorem 2, for \(\alpha _{k}\) in (9) it holds that

$$\begin{aligned} \lim _{k \rightarrow \infty } \alpha _{k}=\frac{2}{\lambda _1+\lambda _n}. \end{aligned}$$

Next theorem shows that \(\tilde{\alpha }_{k}\) converges to \(1/\lambda _{n}\) under each method in the family (9).

Theorem 3

Under the conditions of Theorem 2, let \(\{g_{k}\}\) be the sequence generated by applying a method in the family (9) to minimize the n-dimensional quadratic function (4). Then \(\lim _{k \rightarrow \infty }\tilde{\alpha }_{k}=1/\lambda _{n}\).

Proof

From (18) and (13), we obtain

$$\begin{aligned} c_{j} = g_{k}^{T} A^{j}\psi ^{2}(A) g_{k} = \sum _{i=1}^{n} \lambda _{i}^{j}\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}. \end{aligned}$$
(29)

When k is odd, by the definition of \(\phi _{1}\), (22) and (29), we get

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{1}}{c_{0}^{2}} \nonumber&= \lim _{k \rightarrow \infty } \left( \frac{c_{1}}{c_{0}} \frac{c_{4}}{c_{0}}-\frac{c_{2}}{c_{0}} \frac{c_{3}}{c_{0}}\right) \\ \nonumber&= \lim _{k \rightarrow \infty }\left[ \sum _{i=1}^{n} \lambda _{i} \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n} \left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \cdot \sum _{i=1}^{n} \lambda _{i}^{4} \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n} \left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \right] \\ \nonumber&\quad -\lim _{k \rightarrow \infty }\left[ \sum _{i=1}^{n} \lambda _{i}^2 \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n} \left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \cdot \sum _{i=1}^{n} \lambda _{i}^{3} \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n}\left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \right] \\ \nonumber&=\left( h_{1}\lambda _{1}+ h_{2}\lambda _{n}\right) \left( h_{1}\lambda _{1}^{4}+ h_{2}\lambda _{n}^{4}\right) -\left( h_{1}\lambda _{1}^{2}+ h_{2}\lambda _{n}^{2}\right) \left( h_{1}\lambda _{1}^{3}+ h_{2}\lambda _{n}^{3}\right) \\&=h_{1}h_{2}\lambda _{1}\lambda _{n}\left( \lambda _{n}-\lambda _{1}\right) ^{2}\left( \lambda _{1}+\lambda _{n}\right) . \end{aligned}$$
(30)

Similarly, we have

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{2}}{c_{0}^{2}} = h_{1}h_{2}\left( \lambda _{n}-\lambda _{1}\right) ^{2}\left( \lambda _{1}+\lambda _{n}\right) ^{2} \end{aligned}$$
(31)

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{3}}{c_{0}^{2}} = h_{1}h_{2}\left( \lambda _{n}-\lambda _{1}\right) ^{2}\left( \lambda _{1}+\lambda _{n}\right) . \end{aligned}$$
(32)

Combining (30), (31) and (32), we obtain

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{1}}{\phi _{3}} = \lambda _{1}\lambda _{n} \quad \text { and } \quad \lim _{k \rightarrow \infty } \frac{\phi _{2}}{\phi _{3}} = \lambda _{1}+\lambda _{n}. \end{aligned}$$
(33)

It follows from (33) and (14) that \(\lim _{k \rightarrow \infty } \tilde{\alpha }_{k}=1/\lambda _{n}\). When k is even, we get the desired result in the same manner as above. This completes our proof. \(\square\)

Fig. 1
figure 1

Problem (34) with \(n = 1000\): convergence history of the sequence \(\{|\tilde{\alpha }_{k}-1/\lambda _{n}|\}\) for the first 100 iterations of the MG and Dai–Yang methods

From (33), we see that \(\phi _{1}/\phi _{3}\) and \(\phi _{2}/\phi _{3}\) are independent of \(\psi (A)\). The following example shows that \(\tilde{\alpha }_{k}\) converges to \(1 / \lambda _{n}\) under both the MG and Dai–Yang methods. In particular, we applied the MG and Dai–Yang methods to the quadratic function (4) with

$$\begin{aligned} A=\mathrm {diag}\left\{ a_{1}, a_{2}, \ldots , a_{n}\right\} \quad \text{ and } \quad b=0, \end{aligned}$$
(34)

where \(a_{1}=1,~a_{n}=n\) and \(a_{i}\) was randomly generated in (1, n), \(i=2, \ldots , n-1\). The starting point was set to the vector of all ones. From Fig. 1, we see that \(\tilde{\alpha }_{k}\) approximates \(1 / \lambda _{n}\) with satisfactory accuracy after a small number of iterations under both the MG and Dai–Yang methods.

3 A new gradient method

In this section, we propose a new algorithm for unconstrained quadratic optimization and present its R-linear convergence result.

Extensive studies point out that adaptively choosing a short stepsize or \(\alpha _{k}^{BB1}\) at each iteration is numerically better than the original BB method, see for example [31, 32, 33, 28, 34, 35]. Now we show that the new stepsize \(\tilde{\alpha }_{k}\) is a short one.

Theorem 4

Under the conditions of Lemma 1, it holds that \(\tilde{\alpha }_{k} < c_{2}/c_{3}\).

Proof

According to (13) and (14), we get

$$\begin{aligned} \frac{c_{2}}{c_{3}} - \tilde{\alpha }_{k} = \frac{c_{2} \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}} + c_{2}\phi _{2} - 2 c_{3} \phi _{3}}{c_{3}\left( \phi _{2} + \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}}\right) }. \end{aligned}$$
(35)

If \(c_{2}\phi _{2} - 2 c_{3} \phi _{3} > 0\), we have \(c_{2}/c_{3} - \tilde{\alpha }_{k} > 0\); otherwise, it follows from (35) that

$$\begin{aligned} \begin{aligned} \frac{c_{2}}{c_{3}} - \tilde{\alpha }_{k}&=\frac{4\left( c_{2}c_{3}\phi _{2}\phi _{3}-(c_{3}\phi _{3})^2-c_{2}^2\phi _{1}\phi _{3}\right) }{c_{3}\Delta \left( \phi _{2} + \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}}\right) }\\&=\frac{4\phi _{3}^{2}\left( c_{2}c_{4}-c_{3}^2\right) }{c_{3}\Delta \left( \phi _{2}+\sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}}\right) }, \end{aligned} \end{aligned}$$

where the last equality is due to \(c_{3}\phi _{2} = c_{2}\phi _{1} +c_{4} \phi _{3}\) and \(\Delta = c_{2} \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}} -\left( c_{2}\phi _{2} - 2 c_{3}\phi _{3}\right) >0\). From \(\phi _{1}, \phi _{3}>0\) and \(c_{2}c_{4}-c_{3}^2 > 0\), we know that \(\tilde{\alpha }_{k}<c_{2}/c_{3}\) holds. This completes our proof.\(\square\)

Based on the above analysis, we can develop gradient method using \(\tilde{\alpha }_{k}\) and \(\alpha _{k}^{BB1}\) in an adaptive way. Notice that reusing the retard short stepsize for some iterations could reduce the computational cost and yield better performance, see [29, 36, 25, 13, 7]. So, we suggest to combine the adaptive and cyclic schemes with \(\alpha _{k}^{BB1}\) and \(\tilde{\alpha }_{k-1}\). In particular, our method reuses \(\tilde{\alpha }_{k-1}\) for r iterations when \(\alpha _{k}^{\mathrm {BB} 2} / \alpha _{k}^{\mathrm {BB} 1}<\tau\) for some \(\tau \in (0,1)\); otherwise, we set \(\alpha _{k} = \alpha _{k}^{BB1}\). We use t as the index to keep track of the number of short stepsizes chosen during the iterative process. The stepsize for our algorithm is summarized as

$$\begin{aligned} \alpha _{k}= {\left\{ \begin{array}{ll} \tilde{\alpha }_{k-1}, &{} \text{ if } \bmod (t, r)=0~\text {and} ~\alpha _{k}^{\mathrm {BB} 2} / \alpha _{k}^{\mathrm {BB} 1}<\tau ; \\ \alpha _{k}^{\mathrm {BB} 1}, &{} \text{ if } \bmod (t, r)=0~\text {and} ~\alpha _{k}^{\mathrm {BB} 2} / \alpha _{k}^{\mathrm {BB} 1} \ge \tau ; \\ \alpha _{k-1}, &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(36)

We mention that the parameter \(\tau\) can be chosen dynamically as [24]. However, in our test the dynamic scheme does not show much evidence over the above fixed one. Our method is formally presented in Algorithm 1.

figure a

To establish R-linear convergence of Algorithm 1, we first show the boundedness of \(\tilde{\alpha }_{k}\).

Let us denote

$$\begin{aligned} f_{1}(\alpha )= \frac{g_{k}^{T} \left( I-\alpha A\right) ^{T}\psi ^{2}(A)\left( I-\alpha A\right) g_{k}}{g_{k}^{T}\left( I-\alpha A\right) ^{T} \psi ^{2}(A) A \left( I-\alpha A\right) g_{k}} \end{aligned}$$
(37)

and

$$\begin{aligned} f_{2}(\alpha ) = \frac{g_{k}^{T} \left( I-\alpha A\right) ^{T}\psi ^{2}(A) A \left( I-\alpha A\right) g_{k}}{g_{k}^{T}\left( I-\alpha A\right) ^{T} \psi ^{2}(A) A^{2} \left( I-\alpha A\right) g_{k}}. \end{aligned}$$
(38)

Using the same arguments as (12) and Lemma 1, we know that both \(f_{1}'(\alpha )=0\) and \(f_{2}'(\alpha )=0\) have two roots. In addition, the roots of \(f_{1}'(\alpha )=0\), say \(\tilde{\beta }_{k}\) and \(\hat{\beta }_{k}\) with \(\tilde{\beta }_{k}\le \hat{\beta }_{k}\), are the solutions of

$$\begin{aligned} \phi _{4}\beta ^2-\phi _{3}\beta +\phi _{5}=0, \end{aligned}$$
(39)

and the roots of \(f_{2}'(\alpha )=0\), say \(\tilde{\gamma }_{k}\) and \(\hat{\gamma }_{k}\) with \(\tilde{\gamma }_{k}\le \hat{\gamma }_{k}\), satisfy

$$\begin{aligned} \phi _{6}\gamma ^2-\phi _{1}\gamma +\phi _{4}=0, \end{aligned}$$
(40)

where \(\phi _{1}\), \(\phi _{3}\) are defined in the former section, and \(\phi _{4}=c_{1} c_{3}-c_{2}^{2}\), \(\phi _{5}=c_{0} c_{2}-c_{1}^{2}\) and \(\phi _{6}=c_{2} c_{4}-c_{3}^{2}\). Recall that \(F(\alpha )=f_{1}(\alpha ) f_{2}(\alpha )\). Since \(f_{1}(\alpha )>0\) and \(f_{2}(\alpha )>0\), any root of \(F'(\alpha ) = f_{1}'(\alpha )f_{2}(\alpha ) + f_{1}(\alpha )f_{2}'(\alpha )=0\) yields

$$\begin{aligned} f_{1}'(\alpha )=f_{2}'(\alpha )=0\quad \text {or} \quad f_{1}'(\alpha ) f_{2}'(\alpha )<0. \end{aligned}$$

If the root is such that \(f_{1}'(\alpha )\le 0\) and \(f_{2}'(\alpha )\ge 0\), we must have

$$\begin{aligned} \alpha \in [\tilde{\beta }_{k},\hat{\beta }_{k}] \quad \text {and} \quad \alpha \in (-\infty ,\tilde{\gamma }_{k}]\cup [\hat{\gamma }_{k},+\infty ). \end{aligned}$$
(41)

Similarly, for the case \(f_{1}'(\alpha )\ge 0\) and \(f_{2}'(\alpha )\le 0\), we get

$$\begin{aligned} \alpha \in [\tilde{\gamma }_{k},\hat{\gamma }_{k}] \quad \text {and} \quad \alpha \in (-\infty ,\tilde{\beta }_{k}]\cup [\hat{\beta }_{k},+\infty ). \end{aligned}$$
(42)

Now we are ready to prove that \(\tilde{\alpha }_{k}\) is bounded between \(1/\lambda _{n}\) and \(1/\lambda _{1}\).

Theorem 5

Under the conditions of Lemma 1, it holds that \(1/\lambda _{n}\le \tilde{\alpha }_{k} \le 1/\lambda _{1}\).

Proof

Since \(F'(\tilde{\alpha }_{k}) =0\), when \(f_{1}'(\tilde{\alpha }_{k})\le 0\), it follows from (41) that

$$\begin{aligned} \tilde{\beta }_{k}\le \tilde{\alpha }_{k} \le \min \{\tilde{\gamma }_{k},\hat{\beta }_{k}\} \quad \text {or}\quad \max \{\hat{\gamma }_{k},\tilde{\beta }_{k}\} \le \tilde{\alpha }_{k} \le \hat{\beta }_{k}. \end{aligned}$$

For the case \(f_{1}'(\tilde{\alpha }_{k})\ge 0\), by (42), we get

$$\begin{aligned} \tilde{\gamma }_{k} \le \tilde{\alpha }_{k} \le \min \{\tilde{\beta }_{k},\hat{\gamma }_{k}\} \quad \text {or}\quad \max \{\hat{\beta }_{k},\tilde{\gamma }_{k}\} \le \tilde{\alpha }_{k} \le \hat{\gamma }_{k}. \end{aligned}$$

Thus, we only need to prove that \(1/\lambda _{n}\le \hat{\beta }_{k},~ \tilde{\beta }_{k},~\hat{\gamma }_{k},~ \tilde{\gamma }_{k} \le 1/\lambda _{1}\).

To proceed, we first show that

$$\begin{aligned} \hat{\beta }_{k}=f_{1}(\tilde{\beta }_{k}). \end{aligned}$$
(43)

Since \(f_{1}'(\tilde{\beta }_{k}) = 0\), we have

$$\begin{aligned} (c_{1}-2 c_{2}\tilde{\beta }_{k}+ c_{3}\tilde{\beta }_{k}^{2})(-c_{1}+ c_{2}\tilde{\beta }_{k})-(c_{0}-2 c_{1}\tilde{\beta }_{k}+ c_{2}\tilde{\beta }_{k}^{2})(-c_{2}+ c_{3}\tilde{\beta }_{k})=0, \end{aligned}$$

which implies that

$$\begin{aligned} f_{1}(\tilde{\beta }_{k})&=\frac{c_{0}-2 c_{1}\tilde{\beta }_{k}+c_{2}\tilde{\beta }_{k}^{2} }{c_{1}-2 c_{2} \tilde{\beta }_{k}+ c_{3}\tilde{\beta }_{k}^{2}}=\frac{c_{2} \tilde{\beta }_{k}-c_{1}}{c_{3} \tilde{\beta }_{k}-c_{2}}\\&=\hat{\beta }_{k} \frac{c_{2} \tilde{\beta }_{k}-c_{1}}{c_{3} \hat{\beta }_{k} \tilde{\beta }_{k}-c_{2} \hat{\beta }_{k}}=\hat{\beta }_{k} \frac{c_{2} \tilde{\beta }_{k}-c_{1}}{c_{3} \frac{\phi _{5}}{\phi _{4}}-c_{2} \hat{\beta }_{k}}, \end{aligned}$$

where the last equality comes from the fact that \(\tilde{\beta }_{k}\hat{\beta }_{k}=\phi _{5}/\phi _{4}\).

In order to prove (43), we are suffice to show that

$$\begin{aligned} c_{2} \tilde{\beta }_{k}-c_{1} = c_{3} \frac{\phi _{5}}{\phi _{4}}-c_{2} \hat{\beta }_{k}, \end{aligned}$$

that is,

$$\begin{aligned} c_{2}\phi _{4} (\tilde{\beta }_{k}+\hat{\beta }_{k}) =c_{1}\phi _{4}+ c_{3}\phi _{5}, \end{aligned}$$

which holds due to \(\tilde{\beta }_{k}+\hat{\beta }_{k}=\phi _{3}/\phi _{4}\) and the definitions of \(c_{1}, c_{2}, c_{3}, \phi _{3}, \phi _{4}, \phi _{5}\).

Thus, (43) is true. It follows from the definition of \(f_1\) and the Rayleigh’s quotient property that \(1/\lambda _{n} \le \hat{\beta }_{k}=f_{1}(\tilde{\beta }_{k})\le 1/\lambda _{1}\).

Using the same arguments as above, we get \(1/\lambda _{n} \le \tilde{\beta }_{k},~\tilde{\gamma }_{k},~\hat{\gamma }_{k} \le 1/\lambda _{1}\). This completes the proof. \(\square\)

In [37], the authors prove that any gradient method with stepsizes satisfying the following Property B has R-linear convergence rate \(1-\lambda _{1}/M_1\) which implies a \(1-1/\kappa\) rate when \(M_1\le \lambda _n\). Similar results for gradient methods satisfying the Property A in [38] can be found in [39]. However, a stepsize satisfies Property B may not meets the conditions of Property A.

Property B: We say that the stepsize \(\alpha _{k}\) has Property B if there exist an integer m and positive constant \(M_{1} \ge \lambda _{1}\) such that

  1. (i)

    \(\lambda _{1} \le \alpha _{k}^{-1} \le M_{1}\);

  2. (ii)

    for some real analytic function \(\psi\) defined as (7) and \(v(k) \in \{k,k-1,\ldots ,\max \{k-m+1,0\}\}\),

    $$\begin{aligned} \alpha _{k} \le \frac{g_{v(k)}^{T} \psi ^{2}(A) g_{v(k)}}{g_{v(k)}^{T} A\psi ^{2}(A) g_{v(k)}}. \end{aligned}$$
    (44)

Clearly, \(\alpha _k^{BB1}\) satisfies Property B with \(M_1=\lambda _n\), \(m=2\) and \(\psi ^{2}(A)=I\), and the new stepsize \(\tilde{\alpha }_{k}\) satisfies Property B with \(M_1=\lambda _n\). So, we are able to show R-linear convergence of Algorithm 1 with \(1-1/\kappa\) rate as Theorem 2 in [37]. For completeness, we include the proof here.

Theorem 6

Suppose that the sequence \(\left\{ g_{k}\right\}\) is generated by Algorithm 1. Then either \(g_{k} = 0\) for some finite k or the sequence \(\left\{ \left\| g_{k}\right\| _{2}\right\}\) converges to zero R-linearly in the sense that

$$\begin{aligned} |g_{k}^{(i)}| \le C_{i}\theta ^{k}, \qquad i=1,2, \ldots , n, \end{aligned}$$
(45)

where \({\theta }=1-1/\kappa\) and

$$\begin{aligned} \left\{ \begin{array}{l} C_{1}=|g_{0}^{(1)}|; \\ C_{i}=\displaystyle \max \left\{ |g_{0}^{(i)}|,\frac{|g_{1}^{(i)}|}{\theta }, \ldots , \frac{|g_{r-1}^{(i)}|}{\theta ^{r-1}},\frac{\max \{\sigma _{i}, \sigma _{i}^{r}\}}{\theta ^{r}\psi (\lambda _{i})}\sqrt{\sum _{j=1}^{i-1}\psi ^{2}(\lambda _{j}) C_{j}^{2}}\right\} ,\\ \quad i = 2,3, \ldots n, \end{array}\right. \end{aligned}$$

with \(\sigma _{i}=\max \left\{ \frac{\lambda _{i}}{\lambda _{1}}-1,1-\frac{\lambda _{i}}{\lambda _{n}}\right\}\).

Proof

It follows from \(g_{k+1}^{(i)}=\left( 1-\alpha _{k} \lambda _{i}\right) g_{k}^{(i)}\) and (i) of Property B that

$$\begin{aligned} \left| g_{k+1}^{(i)}\right| \le \sigma _{i}\left| g_{k}^{(i)}\right| , \quad i=1,2, \ldots , n. \end{aligned}$$
(46)

Clearly, (45) holds for \(i=1\).

In what follows, we prove (45) by induction on i. Assume that (45) holds for all \(1 \le i \le L-1\) with \(L \in \{2, \ldots , n\}\). When \(i=L\), it follows from the definition of \(C_{L}\) that (45) holds for \(k = 0, 1,\ldots , r-1\). Assume by contradiction that (45) does not hold for \(k\ge r\) when \(i=L\). Let \(\hat{k}\ge r\) be the minimal index such that \(\left| g_{\hat{k}}^{(L)}\right| > C_{L} \theta ^{\hat{k}}\). Then, if \(\alpha _{\hat{k}-1} \lambda _{L} \le 1\), we get

$$\begin{aligned} \left| g_{\hat{k}}^{(L)}\right| =\left( 1-\alpha _{\hat{k}-1} \lambda _{L}\right) \left| g_{\hat{k}-1}^{(L)}\right| \le \theta \left| g_{\hat{k}-1}^{(L)}\right| \le C_{L} \theta ^{\hat{k}}, \end{aligned}$$

which contradicts our assumption. Thus we must have \(\alpha _{\hat{k}-1} \lambda _{L} > 1\), which combines with (46) and \(\theta <1\) gives that

$$\begin{aligned} \psi (\lambda _{L})|g_{\hat{k}-j}^{(L)}|&\ge \frac{\psi (\lambda _{L})|g_{\hat{k}}^{(L)}|}{{\sigma }_{L}^{j}}>\frac{\psi (\lambda _{L})C_{L} \theta ^{\hat{k}}}{{\sigma }_{L}^{j}} \ge \frac{\psi (\lambda _{L})C_{L} \theta ^{\hat{k}}}{\max \{\sigma _{L}, \sigma _{L}^{r}\}}\\&\ge \theta ^{\hat{k}-r}\sqrt{\sum _{i=1}^{L-1}\psi ^{2}(\lambda _{i}) C_{i}^{2}} \nonumber \ge \theta ^{\hat{k}-r}\sqrt{ \sum _{i=1}^{L-1}\left( \psi (\lambda _{i})g_{\hat{k}-j}^{(i)}\right) ^{2} \theta ^{2(j-\hat{k})}} \nonumber \\&=\theta ^{j-r} \sqrt{G(\hat{k}-j, L-1)} \ge \sqrt{G(\hat{k}-j, L-1)},\quad j \in [1, r], \end{aligned}$$

where \(G(k, l)=\sum _{i=1}^{l}\left( \psi \left( \lambda _{i}\right) g_{k}^{(i)}\right) ^{2}\). This together with Theorem 4 yields that

$$\begin{aligned} \begin{aligned} \left| 1-\tilde{\alpha }_{\hat{k}-1}\lambda _{L}\right|&=\tilde{\alpha }_{\hat{k}-1} \lambda _{L}-1 \le \frac{\left( \psi (A) g_{\hat{k}-1}\right) ^{T}A^2\left( \psi (A) g_{\hat{k}-1}\right) }{\left( \psi (A) g_{\hat{k}-1}\right) ^{T} A^3\left( \psi (A) g_{\hat{k}-1}\right) } \lambda _{L}-1 \\&=\frac{\sum _{i=1}^{n}\left( \lambda _{L}-\lambda _{i}\right) \lambda _{i}^2\left( \psi \left( \lambda _{i}\right) g_{\hat{k}-1}^{(i)}\right) ^{2}}{\sum _{i=1}^{n} \lambda _{i}^3\left( \psi \left( \lambda _{i}\right) g_{\hat{k}-1}^{(i)}\right) ^{2}} \\&\le \frac{\left( \lambda _{L}-\lambda _{1}\right) \lambda _{1}^2 G(\hat{k}-1), L-1)}{\lambda _{1}^3 G(\hat{k}-1), L-1)+\lambda _{L}^3\left( \psi \left( \lambda _{L}\right) g_{\hat{k}-1}^{(L)}\right) ^{2}} \\&\ \le \frac{\lambda _{L}-\lambda _{1}}{\lambda _{L}+\lambda _{1}} < \theta . \end{aligned} \end{aligned}$$

Using similar arguments, we can show \(\left| 1-\alpha _{\hat{k}}^{BB1}\lambda _{L}\right| \le \theta\). Thus, \(\left| g_{\hat{k}}^{(L)}\right| \le \theta \left| g_{\hat{k}-1}^{(L)}\right| \le C_{L} \theta ^{\hat{k}}\) which contradicts our assumption. Hence (45) holds for all i. We complete the proof. \(\square\)

4 Numerical experiments

In this section, we provide numerical experiments of Algorithm 1 on solving quadratic problems. All our codes are written in MATLAB R2016b and carried out on a PC with an AMD Ryzen 5 PRO 2500U, 2.00 GHz processor and 8 GB of RAM running Windows 10 system.

For Algorithm 1, we use \(\tilde{\alpha }_{k}\) with \(\psi (A) =I\). Then, if we compute the vector \(z = Ag_{k}\) at each iteration, and keep into memory \(w = Ag_{k-1}\), \(\tilde{\alpha }_{k-1}\) can be expressed as

$$\begin{aligned} c_{0}= & {} g_{k-1}^{T} g_{k-1}, \quad c_{1}=g_{k-1}^{T} w, \quad c_{2}=w^{T} w,\\ c_{3}= & {} \frac{g_{k}^{T} z-c_{1}+2 \alpha _{k-1} c_{2}}{\alpha _{k-1}^{2}}, \quad c_{4}=\frac{z^{T} z-c_{2}+2 \alpha _{k} c_{3}}{\alpha _{k}^{2}}. \end{aligned}$$

Hence only one matrix-vector product is required in per iteration for Algorithm 1.

Firstly, we compare Algorithm 1 with the BB1 [8], DY [13], ABBmin2 [28], SDC [29] and MGC [7] methods and the Alg.1 method in [25] (we note this method as SL) on solving the following quadratic problem [22]:

$$\begin{aligned} f(x)=\left( x-x^{*}\right) ^{T}V\left( x-x^{*}\right) , \end{aligned}$$
(47)

where \(V = \mathrm {diag}\left\{ v_{1}, \ldots , v_{n}\right\}\) is a diagonal matrix and \(x^{*}\) is randomly generated with components in \([-10,10]\). Five different distributions of the diagonal elements \(v_{j}, j=1,2, \ldots , n\), are generated, see Table 1.

The parameter m in SL uses the value \(m=5\) for the second problem set and \(m = 6\) for other sets to get good performance. For ABBmin2, \(\tau\) is set to 0.9 as suggested in [28]. For SDC and MGC, the parameter pairs (hs) and \((d_{1}, d_{2})\) are all set to (8, 6), which are more efficient than other choices for this test. The parameters in Algorithm 1 are set to \(\tau = 0.3\) and \(r = 5\).

For all comparison methods, the stopping condition is \(\left\| g_{k}\right\| _{2} \le \epsilon \left\| g_{0}\right\| _{2}\), where \(\epsilon >0\) is a given tolerance. The problem dimension is set to \(n = 1000\). For each problem set, three different tolerance parameters \(\epsilon = 10^{-6}, 10^{-9}, 10^{-12}\) and condition numbers \(\kappa = 10^{4}\), \(10^{5}, 10^{6}\) are tested. For each value of \(\kappa\) or \(\epsilon\), 10 starting points are randomly generated in \([-10, 10]\) and the average results are presented in Table 2.

We see that Algorithm 1 clearly outperforms the BB1, DY, SDC and MGC methods. As compared with the SL method, Algorithm 1 is usually faster than it when a high accuracy is required. Moreover, Algorithm 1 often performs better than ABBmin2 for the second to fourth problem sets and is very competitive with ABBmin2 for the first and last problem sets.

Table 1 Distributions of \(v_{j}\)
Table 2 The average number of iterations required by Algorithm 1, the BB1, DY, SL, SDC, MGC and ABBmin2 methods on quadratic problem (47) with spectral distributions in Table 1
Table 3 Test problems from Suitesparse Matrix Collection
Table 4 The average number of iterations required by Algorithm 1, the BB1, DY, SL, SDC, MGC and ABBmin2 methods on problem (4) with A given by Table 3
Table 5 The CPU time required by Algorithm 1, the BB1, DY, SL, SDC, MGC and ABBmin2 methods on problem (4) with A given by Table 3

Next, we compare the above methods on problem (4), where \(b = Ax^{*}\) and \(x^{*}\) is a random vector as before. We test 14 sparse matrices from the SuiteSparse Matrix Collection [30] listed in Table 3. The iteration stops when \(\left\| g_{k}\right\| _{2} \le 10^{-6}\left\| g_{0}\right\| _{2}\). In our test, we choose the parameters so that each method achieves the best performance. Specifically, we set \(m = 4\) for SL, \((h, s) = (3, 4)\) for SDC, \((d_{1},d_{2}) = (4,4)\) for MGC, \(\tau = 0.3\) for ABBmin2 and \(\tau = 0.1~\text { and }~r = 5\) for Algorithm 1 .

For each matrix, ten initial points between \(-10\) and 10 are randomly generated. The average number of iterations of compared methods are listed in Table 4. We see that for most matrices Algorithm 1 has better performance than BB1, DY, SL, MGC, SDC and is competitive with ABBmin2 in the sense of number of iterations. Table 5 lists the average CPU time in seconds for those methods. We observe that Algorithm 1 takes less CPU time than BB1, DY, MGC, ABBmin2, and is as fast as SL and SDC. One important reason for this phenomenon is that our method reuses the stepsize \(\tilde{\alpha }_{k-1}\) for r iterations which can reduce computational cost.

5 Conclusions and discussions

For a uniform analysis, we considered a family of gradient methods whose stepsize is provided by \(\alpha _{k}\) in (9), which includes the Dai–Yang method (5) as a special case. It is proved the family zigzags between two directions in a subspace spanned by the two eigenvectors corresponding to the smallest and largest eigenvalues of the Hessian. In order to achieve the two dimensional quadratic termination of the family, we derived a short stepsize \(\tilde{\alpha }_{k}\) (14) that converges to \(1/\lambda _{n}\). By using the long BB stepsize and \(\tilde{\alpha }_{k-1}\) in a new adaptive cyclic manner, we designed Algorithm 1 for unconstrained quadratic optimization. We proved that Algorithm 1 converges R-linearly at a rate of \(1-1/\kappa\). Our numerical results on minimizing quadratic functions indicate the efficiency of Algorithm 1 over other recent successful gradient methods.

By using the same arguments as those in the proof of Theorem 1, we find stepsizes \(\tilde{\beta }_{k}\) and \(\tilde{\gamma }_{k}\) are such that (37) and (38) achieve the two dimensional quadratic termination, respectively. For the n dimensional quadratic problem, by Theorem 1 in [5], we obtain

$$\begin{aligned} \lim _{k \rightarrow \infty } \tilde{\beta }_{k}=\frac{1}{\lambda _{n}},~\lim _{k \rightarrow \infty } \tilde{\gamma }_{k}=\frac{1}{\lambda _{n}} \end{aligned}$$

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \hat{\beta }_{k}=\frac{1}{\lambda _{1}},~\lim _{k \rightarrow \infty } \hat{\gamma }_{k}=\frac{1}{\lambda _{1}}. \end{aligned}$$

In addition, we have \(\lim _{k \rightarrow \infty } \hat{\alpha }_{k}=1/\lambda _{1}\). However, \(\hat{\alpha }_{k}\), \(\hat{\beta }_{k}\) and \(\hat{\gamma }_{k}\) would not be good approximations of \(1/\lambda _{1}\), for more details see [36]. It is worth noting that the stepsize proposed in [28] is a special case of \(\tilde{\beta }_{k}\) with \(\psi (A)=I\). Moreover, it is not difficult to prove that \(\tilde{\beta }_{k}\) and \(\tilde{\gamma }_{k}\) are short stepsizes in the sense \(\tilde{\beta }_{k}<c_{2}/c_{3}\) and \(\tilde{\gamma }_{k}<c_{1}/c_{2}\). Hence, we can replace \(\tilde{\alpha }_{k-1}\) in Algorithm 1 by \(\tilde{\beta }_{k-1}\) and \(\tilde{\gamma }_{k-1}\), which leads to two variants of Algorithm 1. Preliminary experimental results show that the two variants are competitive with Algorithm 1. Furthermore, we can obtain the same convergence results as Theorem 6.

The results of this paper show that the two dimensional quadratic termination property is useful for designing efficient gradient methods. It would be interesting to develop new algorithms for solving general unconstrained optimization problems based on such a property. We leave this as our future work.

6 Declarations

All data generated or analyzed during this study are included in this published article and are also available from the corresponding author on reasonable request.