A gradient method exploiting the two dimensional quadratic termination property

Li, Xinrui; Huang, Yakui

doi:10.1007/s11590-022-01936-z

A gradient method exploiting the two dimensional quadratic termination property

Manuscript
Published: 22 September 2022

Volume 17, pages 1413–1434, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Optimization Letters Aims and scope Submit manuscript

A gradient method exploiting the two dimensional quadratic termination property

Download PDF

273 Accesses
1 Altmetric
Explore all metrics

Abstract

The quadratic termination property is important to the efficiency of gradient methods. We consider equipping a family of gradient methods, where the stepsize is given by the ratio of two norms, with two dimensional quadratic termination. Such a desired property is achieved by cooperating with a new stepsize which is derived by maximizing the stepsize of the considered family in the next iteration. It is proved that each method in the family will asymptotically alternate in a two dimensional subspace spanned by the eigenvectors corresponding to the largest and smallest eigenvalues. Based on this asymptotic behavior, we show that the new stepsize converges to the reciprocal of the largest eigenvalue of the Hessian. Furthermore, by adaptively taking the long Barzilai–Borwein stepsize and reusing the new stepsize with retard, we propose an efficient gradient method for unconstrained quadratic optimization. We prove that the new method is R-linearly convergent with a rate of $1-1/\kappa$, where $\kappa$ is the condition number of Hessian. Numerical experiments show the efficiency of our proposed method.

On R-linear convergence analysis for a class of gradient methods

Article 17 November 2021

On the Asymptotic Convergence and Acceleration of Gradient Methods

Article 19 November 2021

A Gradient-Based Globalization Strategy for the Newton Method

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The gradient method is well-known for minimizing a smooth function $f(x): \mathbb {R}^{n} \rightarrow \mathbb {R}$, which updates the iterates by

$$\begin{aligned} x_{k+1}=x_{k}-\alpha _{k}g_{k}, \end{aligned}$$

(1)

where $g_{k} = \nabla f\left( x_{k}\right)$ and the stepsize $\alpha _{k}>0$ depends on the method under consideration. The classic steepest descent (SD) [1] and minimal gradient (MG) [2] methods determine $\alpha _{k}$ by minimizing $f\left( x_{k}-\alpha g_{k}\right)$ and $\left\| g\left( x_{k}-\alpha g_{k}\right) \right\| _2$, respectively. Theoretically, the SD and MG methods will asymptotically perform zigzags between two directions, which often yield poor performance in many probelms [3,4,5,6,7].

In 1988, Barzilai and Borwein [8] proposed the following two novel stepsizes from the view of quasi-Newton methods,

$$\begin{aligned} \alpha _{k}^{B B 1}=\arg \min _{\alpha \in \mathbb {R}}\left\| \alpha ^{-1} s_{k-1}-y_{k-1}\right\| _{2}=\frac{s_{k-1}^{T} s_{k-1}}{s_{k-1}^{T} y_{k-1}} \end{aligned}$$

(2)

and

$$\begin{aligned} \alpha _{k}^{B B 2}=\arg \min _{\alpha \in \mathbb {R}}\left\| s_{k-1}-\alpha y_{k-1}\right\| _{2}=\frac{s_{k-1}^{T} y_{k-1}}{y_{k-1}^{T} y_{k-1}}, \end{aligned}$$

(3)

where $s_{k-1}=x_{k}-x_{k-1}$ and $y_{k-1}=g_{k}-g_{k-1}$. Apparently, $\alpha _{k}^{BB1} \ge \alpha _{k}^{BB2}$ follows from the Cauchy–Schwartz inequality if $s_{k-1}^{T}y_{k-1} > 0$. When f(x) is a quadratic function

$$\begin{aligned} f\left( x\right) =\frac{1}{2}x^{T}Ax-b^{T}x, \end{aligned}$$

(4)

where $A\in \mathbb {R}^{n\times n}$ is a real symmetric positive definite matrix and $b\in \mathbb {R}^{n}$, $\alpha _{k}^{B B 1}$ and $\alpha _{k}^{B B 2}$ can be regarded as the SD and MG stepsizes with retard, respectively, i.e.

$$\begin{aligned} \alpha _{k}^{B B 1}=\frac{g_{k-1}^{T} g_{k-1}}{g_{k-1}^{T} A g_{k-1}}=\alpha _{k-1}^{SD} \quad \text{ and } \quad \alpha _{k}^{B B 2}=\frac{g_{k-1}^{T} A g_{k-1}}{g_{k-1}^{T} A^{2} g_{k-1}}=\alpha _{k-1}^{MG}. \end{aligned}$$

Barzilai and Borwein [8] proved that the BB method is R-superlinearly convergent for the two dimensional strictly convex quadratic function. It has been shown that the BB method is globally convergent [9] with R-linear rate [10] for any dimensional cases. Although the BB method is nonmonotone, extensive numerical experimental results indicate that it performs much better than the SD method [11,12,13]. See [14, 15, 16, 17, 18, 19, 20 and 21] for more BB-like methods.

In [22], Yuan derived a new stepsize, which together with the SD method produces the minimizer of a two dimensional strictly convex quadratic function in three iterations. In what follows, if a method can give the exact minimizer of a two dimensional convex quadratic function within finite iterations, we call it has the property of two dimensional quadratic termination. Based on the following variant of the Yuan stepsize,

$$\begin{aligned} \alpha _{k}^{\mathrm {DY}}=\frac{2}{\sqrt{\left( \frac{1}{\alpha _{k-1}^{\mathrm {SD}}}- \frac{1}{\alpha _{k}^{\mathrm {SD}}}\right) ^{2}+\frac{4\left\| g_{k}\right\| _2^{2}}{\left( \alpha _{k-1}^{\mathrm {SD}}\right) ^{2}\left\| g_{k-1}\right\| _2^{2}}+ \frac{1}{\alpha _{k-1}^{\mathrm {SD}}}+\frac{1}{\alpha _{k}^{\mathrm {SD}}}}}, \end{aligned}$$

(5)

Dai and Yuan [23] suggested the so-called Dai–Yuan (DY) gradient method with

$$\begin{aligned} \alpha _{k} =\left\{ \begin{array}{ll} \alpha _{k}^{S D}, &{} \text{ if } \bmod (k, 4)<2; \\ \alpha _{k}^{D Y}, &{} \text{ otherwise. } \end{array}\right. \end{aligned}$$

(6)

Clearly, $\alpha _{k}^{DY}\le \min \{\alpha _{k}^{SD},\alpha _{k-1}^{SD}\}$, which implies that the DY method is monotone. Interestingly, the DY method can even outperform the nonmonotone BB method. Recently, Huang et al. [5] derived a new stepsize, say $\alpha _{k}^{H}$, such that the gradient method

$$\begin{aligned} \alpha _{k} = \frac{g_{k}^{T} \psi (A) g_{k}}{g_{k}^{T} \psi (A) A g_{k}} \end{aligned}$$

together with $\alpha _{k}^{H}$ achieves the two dimensional quadratic termination, where $\psi$ is a real analytic function on $[\lambda _{1},\lambda _{n}]$ and can be expressed by Laurent series

$$\begin{aligned} \psi (z)=\sum _{k=-\infty }^{\infty } d_{k} z^{k},~d_{k} \in \mathbb {R}, \end{aligned}$$

(7)

such that $0<\sum _{k=-\infty }^{\infty } d_{k} z^{k}<+\infty$ for all $z \in \left[ \lambda _{1}, \lambda _{n}\right]$. Here, $\lambda _{1}$ and $\lambda _{n}$ are the smallest and largest eigenvalues of A, respectively. Furthermore, $\alpha _{k}^{H}$ reduces to $\alpha _{k}^{DY}$ when $\psi (A)=I$. The property of two dimensional quadratic termination has shown great potential in improving performances of gradient methods, see [24, 5, 25] for example.

To our knowledge, there is still lack of theoretical analysis for the two dimensional quadratic termination of the Dai–Yang method [26] whose stepsize is given by

$$\begin{aligned} \alpha _{k}^{P}=\frac{\left\| g_{k}\right\| _2}{\left\| A g_{k}\right\| _2}. \end{aligned}$$

(8)

A remarkable property of the Dai–Yang method is that $\alpha _{k}^{P}$ converges to the optimal stepsize $\frac{2}{\lambda _{1}+\lambda _{n}}$ (in the sense that it minimizes the modulus $\Vert I-\alpha A\Vert _{2}$, see [26, 27]). Moreover, the Dai–Yang method is able to find the eigenvectors corresponding to $\lambda _{1}$ and $\lambda _{n}$.

In this paper, for a uniform analysis, we consider equipping the family

$$\begin{aligned} \alpha _{k}=\frac{\left\| \psi (A)g_{k}\right\| _2}{\left\| \psi (A)A g_{k}\right\| _2} \end{aligned}$$

(9)

with the two dimensional quadratic termination property, which will be achieved by cooperating with

$$\begin{aligned} \tilde{\alpha }_{k}=\arg \max _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}= \arg \max _{\alpha _{k} \in \mathbb {R}}~\frac{\left\| \psi (A)\left( I-\alpha _{k}A\right) g_{k}\right\| _2}{\left\| \psi (A)A \left( I-\alpha _{k}A\right) g_{k}\right\| _2}. \end{aligned}$$

(10)

The above strategy of maximizing the stepsize value in the next iteration has been employed in [28] for the SD method. However, the analysis in [28] can not be directly applied to the family (9). Clearly, $\alpha _{k}^{P}$ corresponds to the case $\psi (A) = I$ in (9). We prove that each method in the family (9) will asymptotically alternate in a two dimensional subspace associated with the two eigenvectors corresponding to $\lambda _{1}$ and $\lambda _{n}$. In addition, for any given $\psi$, the stepsize (9) tends to the above optimal stepsize as $k\rightarrow \infty$, and the eigenvectors corresponding to $\lambda _{1}$ and $\lambda _{n}$ can be obtained. Then, we show that $\lim _{k \rightarrow \infty }\tilde{\alpha }_{k} =1 /\lambda _{n}$ for n dimensional strictly convex quadratics. By adaptively taking $\alpha _{k}^{BB1}$ and reusing $\tilde{\alpha }_{k-1}$ for some iterations, we propose a new method for quadratic minimization problems. It is proved that the proposed method is R-linearly convergent with the rate of $1-1/\kappa$, where $\kappa =\lambda _{n} / \lambda _{1}$ is the condition number of A. Our numerical comparisons with the BB1 [8], DY [23], SL (Alg.1 in [25]), ABBmin2 [28], SDC [29], and MGC [7] methods for solving unconstrained random and non-random quadratic optimization demonstrate that the proposed method is very efficient. Further, numerical experiments on quadratic problems whose Hessians are chosen from the SuiteSparse Matrix Collection [30] suggest that the proposed method is very competitive with the above methods.

The paper is organized as follows. In Sect. 2, we derive the new stepsize $\tilde{\alpha }_{k}$ and analyze its properties. The asymptotic behavior of the family (9) is also analyzed. Our new algorithm for quadratic minimization problems as well as its R-linear convergence are presented in Sect. 3. Section 4 presents some numerical comparisons of the proposed method and other successful gradient methods on solving quadratic problems. Finally, in Sect. 5 we give some concluding remarks.

2 A new stepsize and its properties

In this section, we derive the formula of $\tilde{\alpha }_{k}$ and analyze its properties.

To obtain $\tilde{\alpha }_{k}$, we consider

$$\begin{aligned} F(\alpha _{k}):=\alpha _{k+1}^{2}= \frac{\left\| \psi (A)\left( I-\alpha _{k}A\right) g_{k}\right\| _2^{2}}{\left\| \psi (A)A \left( I-\alpha _{k}A\right) g_{k}\right\| _2^{2}}. \end{aligned}$$

(11)

The maximum value of $F(\alpha _{k})$ is achieved when $F'(\alpha _{k}) = 0$, which holds for any ${\alpha }_{k}$ satisfying

$$\begin{aligned} \phi _{1} \alpha _{k}^{2}-\phi _{2} \alpha _{k}+\phi _{3}=0, \end{aligned}$$

(12)

where $\phi _{1}=c_{1} c_{4}-c_{2} c_{3}$, $\phi _{2}=c_{0} c_{4}-c_{2}^{2}$ and $\phi _{3}=c_{0} c_{3}-c_{1}c_{2}$ with

$$\begin{aligned} c_{j} = g_{k}^{T} A^{j} \psi ^{2}(A) g_{k}, \qquad j = 0,1,2,3,4. \end{aligned}$$

(13)

The following lemma guarantees that (12) has two roots.

Lemma 1

Assume $g_{k} \ne 0$ and $g_{k}$ is not parallel to $Ag_{k}$. Then, $\phi _{1}, \phi _{2}, \phi _{3} > 0$ and $\phi _{2}^2 - 4\phi _{1}\phi _{3} > 0$.

Proof

It follows from the Cauchy-Schwartz inequality and (13) that

$$\begin{aligned} \begin{aligned} \left\| \psi (A)A^{\frac{j}{2}}g_{k}\right\| _2^2 \left\| \psi (A)A^{\frac{j+2}{2}}g_{k}\right\| _2^2&>\left( \left( \psi (A)A^{\frac{j}{2}}g_{k}\right) ^T\left( \psi (A)A^{\frac{j+2}{2}}g_{k}\right) \right) ^2\\&=\left( g_{k}^{T}\psi ^{2}(A)A^{j+1}g_{k}\right) ^{2} \end{aligned} \end{aligned}$$

for $j\ge 0$. That is, $c_{j} / c_{j+1} > c_{j+1}/ c_{j+2}$, which implies $\phi _{1}, \phi _{2}, \phi _{3} > 0.$

By direct calculation, we obtain $c_{1} \phi _{2} = c_{0} \phi _{1} + c_{2} \phi _{3}$, which implies that

$$\begin{aligned} c_{1}^{2}\phi _{2}^{2} =\left( c_{0} \phi _{1} + c_{2} \phi _{3}\right) ^{2} \ge 4c_{0}c_{2}\phi _{1} \phi _{3}. \end{aligned}$$

Combining with $c_{0} c_{2}>c_{1}^{2}$, we have $\phi _{2}^2 - 4\phi _{1}\phi _{3} > 0$. This completes the proof.$\square$

Using the square root law, we get the two roots of (12) as

$$\begin{aligned} \tilde{\alpha }_{k} = \frac{\phi _{2} - \sqrt{\phi _{2}^{2}-4 \phi _{1} \phi _{3}}}{2 \phi _{1}}=\frac{2}{\frac{\phi _{2}}{\phi _{3}} + \sqrt{\left( \frac{\phi _{2}}{\phi _{3}}\right) ^{2}-4 \frac{\phi _{1}}{\phi _{3}}}} \end{aligned}$$

(14)

and

$$\begin{aligned} \hat{\alpha }_{k} = \frac{\phi _{2} + \sqrt{\phi _{2}^{2}-4 \phi _{1} \phi _{3}}}{2 \phi _{1}}= \frac{2}{\frac{\phi _{2}}{\phi _{3}} - \sqrt{\left( \frac{\phi _{2}}{\phi _{3}}\right) ^{2}-4 \frac{\phi _{1}}{\phi _{3}}}}. \end{aligned}$$

(15)

It is easy to see that $\tilde{\alpha }_{k} = \arg \max _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}$ and $\hat{\alpha }_{k} =\arg \min _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}$.

Next theorem presents the two dimensional quadratic termination of the gradient method using $\alpha _{k}$ in (9) and $\tilde{\alpha }_{k}$.

Theorem 1

(Two dimensional quadratic termination) Consider the gradient method (1) for minimizing the two dimensional quadratic function (4). If the stepsize $\alpha _{k}$ is given by (9) for all $k \ne k_{0}$ and $\alpha _{k_{0}}=\tilde{\alpha }_{k_{0}}$ at the $k_{0}$-th iteration where $k_{0} \ge 1$, it holds that $g_{k_{0}+i}=0$ for some $1 \le i \le 3$.

Proof

Without loss of generality, we assume that $A={\text {diag}}\{1, \lambda \}$ with $\lambda >0$. Let $g_{k}^{(1)}$ and $g_{k}^{(2)}$ be the first and second components of $g_{k}$, respectively. Notice that

$$\begin{aligned} c_{j}=g_{k}^{T} \psi ^{2}(A) A^{j} g_{k}=\left( g_k^{(1)}\right) ^{2} \psi ^{2}(1)+\lambda ^{j}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(\lambda ). \end{aligned}$$

After direct calculation and simplification, we get

$$\begin{aligned} \begin{aligned} \phi _{1}&=\left( g_k^{(1)}\right) ^{2}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(1) \psi ^{2}(\lambda )(\lambda -1)^{2}(\lambda +1) \lambda , \\ \phi _{2}&=\left( g_k^{(1)}\right) ^{2}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(1) \psi ^{2}(\lambda )(\lambda -1)^{2}(\lambda +1)^{2}, \\ \phi _{3}&=\left( g_k^{(1)}\right) ^{2}\left( g_k^{(2)}\right) ^{2} \psi ^{2}(1) \psi ^{2}(\lambda )(\lambda -1)^{2}(\lambda +1). \end{aligned} \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \frac{\phi _{1}}{\phi _{3}} = \lambda \quad \text {and} \quad \frac{\phi _{2}}{\phi _{3}} = \lambda +1. \end{aligned}$$

Thus, from (14) we know that $\tilde{\alpha }_{k} = 1/\lambda$ for all $k\ge 1$. The conclusion follows immediately from $\alpha _{k_{0}}=\tilde{\alpha }_{k_{0}}$ and $g_{k+1}=\left( I-\alpha _{k}A\right) g_{k}$. We complete the proof.$\square$

In what follows, we shall prove that $\tilde{\alpha }_{k}$ converges to $1/\lambda _{n}$ under each method in the family (9). To this aim, we have to analyze the asymptotic behavior of the family (9) first.

For convenience, we assume without loss of generality that the matrix A is diagonal with distinct eigenvalues, i.e.

$$\begin{aligned} A=\mathrm {diag}\{\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\}, \quad 0<\lambda _{1}<\lambda _{2}<\ldots <\lambda _{n}. \end{aligned}$$

(16)

Let $\left\{ \xi _{1}, \xi _{2}, \ldots , \xi _{n}\right\}$ be the set of orthogonal eigenvectors associated with the eigenvalues $\left\{ \lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\right\}$. Denoting by $\mu _{k}^{(i)}, i=1, \ldots , n$, the components of $g_{k}$ along $\xi _{i}$, i.e.

$$\begin{aligned} g_{k}=\sum _{i=1}^{n} \mu _{k}^{(i)} \xi _{i}. \end{aligned}$$

(17)

It follows from (1) and (17) that

$$\begin{aligned} g_{k+1}=\left( I-\alpha _{k}A\right) g_{k}=\prod _{j=0}^{k}\left( I-\alpha _{j} A\right) g_{0}=\sum _{i=1}^{n} \mu _{k+1}^{(i)} \xi _{i}, \end{aligned}$$

(18)

where

$$\begin{aligned} \mu _{k+1}^{(i)}=\left( 1-\alpha _{k} \lambda _{i}\right) \mu _{k}^{(i)}=\mu _{0}^{(i)} \prod _{j=0}^{k}\left( 1-\alpha _{j} \lambda _{i}\right) . \end{aligned}$$

(19)

The following lemma is useful in our analysis.

Lemma 2

[26] Let p be a vector in $\mathbb {R}^{n}$ such that (i) $p^{(1)}>0$ and $p^{(n)}>0$ (ii) $p^{(1)} + p^{(n)} = 1$. Further assume that $0<\lambda _{1}<\cdots <\lambda _{n}$. Consider a transformation T such that

$$\begin{aligned} \left( Tp\right) ^{(i)}=\frac{\left( \lambda _{i}-\gamma (p)\right) ^{2} p^{(i)}}{\sum _{i}\left( \lambda _{i}-\gamma (p)\right) ^{2} p^{(i)}}, \quad \text{ where }\quad \gamma (p)=\sqrt{\sum _{i} \lambda _{i}^{2} p^{(i)}}. \end{aligned}$$

Then

$$\begin{aligned} \lim _{k \rightarrow \infty } T^{k}p= {\left\{ \begin{array}{ll} h_{1}, &{} \text{ if } i=1, \\ 0, &{} \text{ if } i=2, \ldots , n-1, \\ h_{2}, &{} \text{ if } i=n,\end{array}\right. } \end{aligned}$$

where

$$\begin{aligned} h_{1}=\frac{\lambda _{1}+3 \lambda _{n}}{4\left( \lambda _{1}+\lambda _{n}\right) } \quad \text{ and } \quad h_{2}=\frac{3 \lambda _{1}+ \lambda _{n}}{4\left( \lambda _{1}+\lambda _{n}\right) }. \end{aligned}$$

(20)

Based on Lemma 2, we are able to show that each method in the family (9) will zigzag in a two dimensional subspace which generalizes the results in [26], where the case $\psi (A)=I$ (i.e. the Dai–Yang method) is considered.

Theorem 2

Assume that the starting point $x_{0}$ is such that

$$\begin{aligned} g_{0}^{T} \xi _{1} \ne 0 \quad \text{ and } \quad g_{0}^{T} \xi _{n} \ne 0. \end{aligned}$$

Let $\{x_{k} \}$ be the sequence generated by applying a method in the family (9). Then

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (\lambda _{i})\mu _{2k}^{(i)}}{\sqrt{{\sum _{j=1}^{n}\left( \psi (\lambda _{i})\mu _{2k}^{(i)}\right) ^{2}}}}=\left\{ \begin{array}{ll} \mathrm {sign}\left( \psi (\lambda _{1})\mu _{2k}^{(1)}\right) \sqrt{h_{1}}, &{} \text{ if } i=1, \\ 0, &{} \text{ if } i=2, \ldots , n-1, \\ \mathrm {sign}\left( \psi (\lambda _{n})\mu _{2k}^{(n)}\right) \sqrt{h_{2}}, &{} \text{ if } i=n, \end{array}\right. \end{aligned}$$

(21)

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (\lambda _{i})\mu _{2k+1}^{(i)}}{\sqrt{\sum _{j=1}^{n}\left( \psi (\lambda _{i})\mu _{2k+1}^{(i)}\right) ^{2}}}= \left\{ \begin{array}{ll} \mathrm {sign}\left( \psi (\lambda _{1})\mu _{2k+1}^{(1)}\right) \sqrt{h_{1}}, &{} \text{ if } i=1, \\ 0, &{} \text{ if } i=2, \ldots , n-1, \\ \mathrm {-sign}\left( \psi (\lambda _{n})\mu _{2k+1}^{(n)}\right) \sqrt{h_{2}}, &{} \text{ if } i=n, \end{array}\right. \end{aligned}$$

(22)

where $h_{1}$ and $h_{2}$ are defined in (20). Further, the vectors

$$\begin{aligned} \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}+\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}} ~~~\text {and}~~~ \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}-\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}} \end{aligned}$$

tend to be the eigenvectors corresponding to $\lambda _{1}$ and $\lambda _{n}$ of A, respectively.

Proof

By (17), we get

$$\begin{aligned} \psi (A)g_{k}=\sum _{i=1}^{n}\eta _{k}^{(i)} \xi _{i}, \end{aligned}$$

(23)

where $\eta _{k}^{(i)} = \psi (\lambda _{i}) \mu _{k}^{(i)}$, which together with (19) yields

$$\begin{aligned} \eta _{k+1}^{(i)} = \left( 1-\alpha _{k}\lambda _{i}\right) \eta _{k}^{i}. \end{aligned}$$

(24)

Defining the vector $p_{k}=\left( p_{k}^{(i)}\right)$ with

$$\begin{aligned} p_{k}^{(i)}=\frac{\left( \eta _{k}^{(i)}\right) ^{2}}{\left\| \eta _{k}\right\| _{2}^{2}} \end{aligned}$$

(25)

and

$$\begin{aligned} \gamma _{k}= \alpha _{k}^{-1}=\frac{\left\| \psi (A)A g_{k}\right\| _{2}}{\left\| \psi (A)g_{k}\right\| _{2}} =\sqrt{\sum _{i} \lambda _{i}^{2} p_{k}^{(i)}}. \end{aligned}$$

(26)

We have from (24), (25) and (26) that

$$\begin{aligned} p_{k+1}^{(i)}=\frac{\left( \lambda _{i}-\gamma _{k}\right) ^{2} p_{k}^{(i)}}{\sum _{i}\left( \lambda _{i}-\gamma _{k}\right) ^{2} p_{k}^{(i)}}. \end{aligned}$$

Clearly, according to the definition of $p_{k}$, we get that $p_{k}^{(i)}\ge 0$ for all i and

$$\begin{aligned} \sum _{i} p_{k}^{(i)}=1, \quad \text{ for } \text{ all } k. \end{aligned}$$

Let $p=p_{1} \in \mathbb {R}^{n}$, based on the above analysis and Lemma 2, we know that $\lim _{k \rightarrow \infty } p_{k}=\left( h_{1}, 0, \ldots , 0, h_{2}\right) ^{T}$, where $h_{1}$ and $h_{2}$ are given in (20). It follows from (24) and $\lambda _{n}^{-1}<\alpha _{k}<\lambda _{1}^{-1}$ that

$$\begin{aligned} \mathrm {sign}\left( \psi (\lambda _{1})\mu _{k+1}^{(1)}\right) =\mathrm {sign}\left( \psi (\lambda _{1})\mu _{k}^{(1)}\right) ~ \end{aligned}$$

(27)

and

$$\begin{aligned} \mathrm {sign}\left( \psi (\lambda _{n})\mu _{k+1}^{(n)}\right) =-\mathrm {sign}\left( \psi (\lambda _{n})\mu _{k}^{(n)}\right) . \end{aligned}$$

(28)

Thus, by Lemma 2, (27) and (28), we know that (21) and (22) hold.

Furthermore, combining (23), (27) and (28), we find that

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}+\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}}=2 \mathrm {sign}\left( \psi (\lambda _{1})\mu _{2k}^{(1)}\right) \sqrt{h_{1}} \xi _{1} \end{aligned}$$

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\psi (A)g_{k}}{\Vert \psi (A)g_{k}\Vert _{2}}-\frac{\psi (A)g_{k+1}}{\left\| \psi (A)g_{k+1}\right\| _{2}}=\pm 2 \sqrt{h_{2}} \xi _{n}. \end{aligned}$$

This completes our proof.$\square$

From Theorem 2, we have the following asymptotic result of the stepsize (9).

Corollary 1

Under the conditions of Theorem 2, for $\alpha _{k}$ in (9) it holds that

$$\begin{aligned} \lim _{k \rightarrow \infty } \alpha _{k}=\frac{2}{\lambda _1+\lambda _n}. \end{aligned}$$

Next theorem shows that $\tilde{\alpha }_{k}$ converges to $1/\lambda _{n}$ under each method in the family (9).

Theorem 3

Under the conditions of Theorem 2, let $\{g_{k}\}$ be the sequence generated by applying a method in the family (9) to minimize the n-dimensional quadratic function (4). Then $\lim _{k \rightarrow \infty }\tilde{\alpha }_{k}=1/\lambda _{n}$.

Proof

From (18) and (13), we obtain

$$\begin{aligned} c_{j} = g_{k}^{T} A^{j}\psi ^{2}(A) g_{k} = \sum _{i=1}^{n} \lambda _{i}^{j}\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}. \end{aligned}$$

(29)

When k is odd, by the definition of $\phi _{1}$, (22) and (29), we get

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{1}}{c_{0}^{2}} \nonumber&= \lim _{k \rightarrow \infty } \left( \frac{c_{1}}{c_{0}} \frac{c_{4}}{c_{0}}-\frac{c_{2}}{c_{0}} \frac{c_{3}}{c_{0}}\right) \\ \nonumber&= \lim _{k \rightarrow \infty }\left[ \sum _{i=1}^{n} \lambda _{i} \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n} \left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \cdot \sum _{i=1}^{n} \lambda _{i}^{4} \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n} \left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \right] \\ \nonumber&\quad -\lim _{k \rightarrow \infty }\left[ \sum _{i=1}^{n} \lambda _{i}^2 \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n} \left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \cdot \sum _{i=1}^{n} \lambda _{i}^{3} \frac{\left( \psi (\lambda _{i})\mu _{k}^{(i)}\right) ^{2}}{\sum _{s=1}^{n}\left( \psi (\lambda _{s})\mu _{k}^{(s)}\right) ^{2}} \right] \\ \nonumber&=\left( h_{1}\lambda _{1}+ h_{2}\lambda _{n}\right) \left( h_{1}\lambda _{1}^{4}+ h_{2}\lambda _{n}^{4}\right) -\left( h_{1}\lambda _{1}^{2}+ h_{2}\lambda _{n}^{2}\right) \left( h_{1}\lambda _{1}^{3}+ h_{2}\lambda _{n}^{3}\right) \\&=h_{1}h_{2}\lambda _{1}\lambda _{n}\left( \lambda _{n}-\lambda _{1}\right) ^{2}\left( \lambda _{1}+\lambda _{n}\right) . \end{aligned}$$

(30)

Similarly, we have

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{2}}{c_{0}^{2}} = h_{1}h_{2}\left( \lambda _{n}-\lambda _{1}\right) ^{2}\left( \lambda _{1}+\lambda _{n}\right) ^{2} \end{aligned}$$

(31)

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{3}}{c_{0}^{2}} = h_{1}h_{2}\left( \lambda _{n}-\lambda _{1}\right) ^{2}\left( \lambda _{1}+\lambda _{n}\right) . \end{aligned}$$

(32)

Combining (30), (31) and (32), we obtain

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\phi _{1}}{\phi _{3}} = \lambda _{1}\lambda _{n} \quad \text { and } \quad \lim _{k \rightarrow \infty } \frac{\phi _{2}}{\phi _{3}} = \lambda _{1}+\lambda _{n}. \end{aligned}$$

(33)

It follows from (33) and (14) that $\lim _{k \rightarrow \infty } \tilde{\alpha }_{k}=1/\lambda _{n}$. When k is even, we get the desired result in the same manner as above. This completes our proof. $\square$

From (33), we see that $\phi _{1}/\phi _{3}$ and $\phi _{2}/\phi _{3}$ are independent of $\psi (A)$. The following example shows that $\tilde{\alpha }_{k}$ converges to $1 / \lambda _{n}$ under both the MG and Dai–Yang methods. In particular, we applied the MG and Dai–Yang methods to the quadratic function (4) with

$$\begin{aligned} A=\mathrm {diag}\left\{ a_{1}, a_{2}, \ldots , a_{n}\right\} \quad \text{ and } \quad b=0, \end{aligned}$$

(34)

where $a_{1}=1,~a_{n}=n$ and $a_{i}$ was randomly generated in (1, n), $i=2, \ldots , n-1$. The starting point was set to the vector of all ones. From Fig. 1, we see that $\tilde{\alpha }_{k}$ approximates $1 / \lambda _{n}$ with satisfactory accuracy after a small number of iterations under both the MG and Dai–Yang methods.

3 A new gradient method

In this section, we propose a new algorithm for unconstrained quadratic optimization and present its R-linear convergence result.

Extensive studies point out that adaptively choosing a short stepsize or $\alpha _{k}^{BB1}$ at each iteration is numerically better than the original BB method, see for example [31, 32, 33, 28, 34, 35]. Now we show that the new stepsize $\tilde{\alpha }_{k}$ is a short one.

Theorem 4

Under the conditions of Lemma 1, it holds that $\tilde{\alpha }_{k} < c_{2}/c_{3}$.

Proof

According to (13) and (14), we get

$$\begin{aligned} \frac{c_{2}}{c_{3}} - \tilde{\alpha }_{k} = \frac{c_{2} \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}} + c_{2}\phi _{2} - 2 c_{3} \phi _{3}}{c_{3}\left( \phi _{2} + \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}}\right) }. \end{aligned}$$

(35)

If $c_{2}\phi _{2} - 2 c_{3} \phi _{3} > 0$, we have $c_{2}/c_{3} - \tilde{\alpha }_{k} > 0$; otherwise, it follows from (35) that

$$\begin{aligned} \begin{aligned} \frac{c_{2}}{c_{3}} - \tilde{\alpha }_{k}&=\frac{4\left( c_{2}c_{3}\phi _{2}\phi _{3}-(c_{3}\phi _{3})^2-c_{2}^2\phi _{1}\phi _{3}\right) }{c_{3}\Delta \left( \phi _{2} + \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}}\right) }\\&=\frac{4\phi _{3}^{2}\left( c_{2}c_{4}-c_{3}^2\right) }{c_{3}\Delta \left( \phi _{2}+\sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}}\right) }, \end{aligned} \end{aligned}$$

where the last equality is due to $c_{3}\phi _{2} = c_{2}\phi _{1} +c_{4} \phi _{3}$ and $\Delta = c_{2} \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}} -\left( c_{2}\phi _{2} - 2 c_{3}\phi _{3}\right) >0$. From $\phi _{1}, \phi _{3}>0$ and $c_{2}c_{4}-c_{3}^2 > 0$, we know that $\tilde{\alpha }_{k}<c_{2}/c_{3}$ holds. This completes our proof.$\square$

Based on the above analysis, we can develop gradient method using $\tilde{\alpha }_{k}$ and $\alpha _{k}^{BB1}$ in an adaptive way. Notice that reusing the retard short stepsize for some iterations could reduce the computational cost and yield better performance, see [29, 36, 25, 13, 7]. So, we suggest to combine the adaptive and cyclic schemes with $\alpha _{k}^{BB1}$ and $\tilde{\alpha }_{k-1}$. In particular, our method reuses $\tilde{\alpha }_{k-1}$ for r iterations when $\alpha _{k}^{\mathrm {BB} 2} / \alpha _{k}^{\mathrm {BB} 1}<\tau$ for some $\tau \in (0,1)$; otherwise, we set $\alpha _{k} = \alpha _{k}^{BB1}$. We use t as the index to keep track of the number of short stepsizes chosen during the iterative process. The stepsize for our algorithm is summarized as

$$\begin{aligned} \alpha _{k}= {\left\{ \begin{array}{ll} \tilde{\alpha }_{k-1}, &{} \text{ if } \bmod (t, r)=0~\text {and} ~\alpha _{k}^{\mathrm {BB} 2} / \alpha _{k}^{\mathrm {BB} 1}<\tau ; \\ \alpha _{k}^{\mathrm {BB} 1}, &{} \text{ if } \bmod (t, r)=0~\text {and} ~\alpha _{k}^{\mathrm {BB} 2} / \alpha _{k}^{\mathrm {BB} 1} \ge \tau ; \\ \alpha _{k-1}, &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

(36)

We mention that the parameter $\tau$ can be chosen dynamically as [24]. However, in our test the dynamic scheme does not show much evidence over the above fixed one. Our method is formally presented in Algorithm 1.

To establish R-linear convergence of Algorithm 1, we first show the boundedness of $\tilde{\alpha }_{k}$.

Let us denote

$$\begin{aligned} f_{1}(\alpha )= \frac{g_{k}^{T} \left( I-\alpha A\right) ^{T}\psi ^{2}(A)\left( I-\alpha A\right) g_{k}}{g_{k}^{T}\left( I-\alpha A\right) ^{T} \psi ^{2}(A) A \left( I-\alpha A\right) g_{k}} \end{aligned}$$

(37)

and

$$\begin{aligned} f_{2}(\alpha ) = \frac{g_{k}^{T} \left( I-\alpha A\right) ^{T}\psi ^{2}(A) A \left( I-\alpha A\right) g_{k}}{g_{k}^{T}\left( I-\alpha A\right) ^{T} \psi ^{2}(A) A^{2} \left( I-\alpha A\right) g_{k}}. \end{aligned}$$

(38)

Using the same arguments as (12) and Lemma 1, we know that both $f_{1}'(\alpha )=0$ and $f_{2}'(\alpha )=0$ have two roots. In addition, the roots of $f_{1}'(\alpha )=0$, say $\tilde{\beta }_{k}$ and $\hat{\beta }_{k}$ with $\tilde{\beta }_{k}\le \hat{\beta }_{k}$, are the solutions of

$$\begin{aligned} \phi _{4}\beta ^2-\phi _{3}\beta +\phi _{5}=0, \end{aligned}$$

(39)

and the roots of $f_{2}'(\alpha )=0$, say $\tilde{\gamma }_{k}$ and $\hat{\gamma }_{k}$ with $\tilde{\gamma }_{k}\le \hat{\gamma }_{k}$, satisfy

$$\begin{aligned} \phi _{6}\gamma ^2-\phi _{1}\gamma +\phi _{4}=0, \end{aligned}$$

(40)

where $\phi _{1}$, $\phi _{3}$ are defined in the former section, and $\phi _{4}=c_{1} c_{3}-c_{2}^{2}$, $\phi _{5}=c_{0} c_{2}-c_{1}^{2}$ and $\phi _{6}=c_{2} c_{4}-c_{3}^{2}$. Recall that $F(\alpha )=f_{1}(\alpha ) f_{2}(\alpha )$. Since $f_{1}(\alpha )>0$ and $f_{2}(\alpha )>0$, any root of $F'(\alpha ) = f_{1}'(\alpha )f_{2}(\alpha ) + f_{1}(\alpha )f_{2}'(\alpha )=0$ yields

$$\begin{aligned} f_{1}'(\alpha )=f_{2}'(\alpha )=0\quad \text {or} \quad f_{1}'(\alpha ) f_{2}'(\alpha )<0. \end{aligned}$$

If the root is such that $f_{1}'(\alpha )\le 0$ and $f_{2}'(\alpha )\ge 0$, we must have

$$\begin{aligned} \alpha \in [\tilde{\beta }_{k},\hat{\beta }_{k}] \quad \text {and} \quad \alpha \in (-\infty ,\tilde{\gamma }_{k}]\cup [\hat{\gamma }_{k},+\infty ). \end{aligned}$$

(41)

Similarly, for the case $f_{1}'(\alpha )\ge 0$ and $f_{2}'(\alpha )\le 0$, we get

$$\begin{aligned} \alpha \in [\tilde{\gamma }_{k},\hat{\gamma }_{k}] \quad \text {and} \quad \alpha \in (-\infty ,\tilde{\beta }_{k}]\cup [\hat{\beta }_{k},+\infty ). \end{aligned}$$

(42)

Now we are ready to prove that $\tilde{\alpha }_{k}$ is bounded between $1/\lambda _{n}$ and $1/\lambda _{1}$.

Theorem 5

Under the conditions of Lemma 1, it holds that $1/\lambda _{n}\le \tilde{\alpha }_{k} \le 1/\lambda _{1}$.

Proof

Since $F'(\tilde{\alpha }_{k}) =0$, when $f_{1}'(\tilde{\alpha }_{k})\le 0$, it follows from (41) that

$$\begin{aligned} \tilde{\beta }_{k}\le \tilde{\alpha }_{k} \le \min \{\tilde{\gamma }_{k},\hat{\beta }_{k}\} \quad \text {or}\quad \max \{\hat{\gamma }_{k},\tilde{\beta }_{k}\} \le \tilde{\alpha }_{k} \le \hat{\beta }_{k}. \end{aligned}$$

For the case $f_{1}'(\tilde{\alpha }_{k})\ge 0$, by (42), we get

$$\begin{aligned} \tilde{\gamma }_{k} \le \tilde{\alpha }_{k} \le \min \{\tilde{\beta }_{k},\hat{\gamma }_{k}\} \quad \text {or}\quad \max \{\hat{\beta }_{k},\tilde{\gamma }_{k}\} \le \tilde{\alpha }_{k} \le \hat{\gamma }_{k}. \end{aligned}$$

Thus, we only need to prove that $1/\lambda _{n}\le \hat{\beta }_{k},~ \tilde{\beta }_{k},~\hat{\gamma }_{k},~ \tilde{\gamma }_{k} \le 1/\lambda _{1}$.

To proceed, we first show that

$$\begin{aligned} \hat{\beta }_{k}=f_{1}(\tilde{\beta }_{k}). \end{aligned}$$

(43)

Since $f_{1}'(\tilde{\beta }_{k}) = 0$, we have

$$\begin{aligned} (c_{1}-2 c_{2}\tilde{\beta }_{k}+ c_{3}\tilde{\beta }_{k}^{2})(-c_{1}+ c_{2}\tilde{\beta }_{k})-(c_{0}-2 c_{1}\tilde{\beta }_{k}+ c_{2}\tilde{\beta }_{k}^{2})(-c_{2}+ c_{3}\tilde{\beta }_{k})=0, \end{aligned}$$

which implies that

$$\begin{aligned} f_{1}(\tilde{\beta }_{k})&=\frac{c_{0}-2 c_{1}\tilde{\beta }_{k}+c_{2}\tilde{\beta }_{k}^{2} }{c_{1}-2 c_{2} \tilde{\beta }_{k}+ c_{3}\tilde{\beta }_{k}^{2}}=\frac{c_{2} \tilde{\beta }_{k}-c_{1}}{c_{3} \tilde{\beta }_{k}-c_{2}}\\&=\hat{\beta }_{k} \frac{c_{2} \tilde{\beta }_{k}-c_{1}}{c_{3} \hat{\beta }_{k} \tilde{\beta }_{k}-c_{2} \hat{\beta }_{k}}=\hat{\beta }_{k} \frac{c_{2} \tilde{\beta }_{k}-c_{1}}{c_{3} \frac{\phi _{5}}{\phi _{4}}-c_{2} \hat{\beta }_{k}}, \end{aligned}$$

where the last equality comes from the fact that $\tilde{\beta }_{k}\hat{\beta }_{k}=\phi _{5}/\phi _{4}$.

In order to prove (43), we are suffice to show that

$$\begin{aligned} c_{2} \tilde{\beta }_{k}-c_{1} = c_{3} \frac{\phi _{5}}{\phi _{4}}-c_{2} \hat{\beta }_{k}, \end{aligned}$$

that is,

$$\begin{aligned} c_{2}\phi _{4} (\tilde{\beta }_{k}+\hat{\beta }_{k}) =c_{1}\phi _{4}+ c_{3}\phi _{5}, \end{aligned}$$

which holds due to $\tilde{\beta }_{k}+\hat{\beta }_{k}=\phi _{3}/\phi _{4}$ and the definitions of $c_{1}, c_{2}, c_{3}, \phi _{3}, \phi _{4}, \phi _{5}$.

Thus, (43) is true. It follows from the definition of $f_1$ and the Rayleigh’s quotient property that $1/\lambda _{n} \le \hat{\beta }_{k}=f_{1}(\tilde{\beta }_{k})\le 1/\lambda _{1}$.

Using the same arguments as above, we get $1/\lambda _{n} \le \tilde{\beta }_{k},~\tilde{\gamma }_{k},~\hat{\gamma }_{k} \le 1/\lambda _{1}$. This completes the proof. $\square$

In [37], the authors prove that any gradient method with stepsizes satisfying the following Property B has R-linear convergence rate $1-\lambda _{1}/M_1$ which implies a $1-1/\kappa$ rate when $M_1\le \lambda _n$. Similar results for gradient methods satisfying the Property A in [38] can be found in [39]. However, a stepsize satisfies Property B may not meets the conditions of Property A.

Property B: We say that the stepsize $\alpha _{k}$ has Property B if there exist an integer m and positive constant $M_{1} \ge \lambda _{1}$ such that

(i)
$\lambda _{1} \le \alpha _{k}^{-1} \le M_{1}$;
(ii)
for some real analytic function $\psi$ defined as (7) and $v(k) \in \{k,k-1,\ldots ,\max \{k-m+1,0\}\}$,
$$\begin{aligned} \alpha _{k} \le \frac{g_{v(k)}^{T} \psi ^{2}(A) g_{v(k)}}{g_{v(k)}^{T} A\psi ^{2}(A) g_{v(k)}}. \end{aligned}$$
(44)

Clearly, $\alpha _k^{BB1}$ satisfies Property B with $M_1=\lambda _n$, $m=2$ and $\psi ^{2}(A)=I$, and the new stepsize $\tilde{\alpha }_{k}$ satisfies Property B with $M_1=\lambda _n$. So, we are able to show R-linear convergence of Algorithm 1 with $1-1/\kappa$ rate as Theorem 2 in [37]. For completeness, we include the proof here.

Theorem 6

Suppose that the sequence $\left\{ g_{k}\right\}$ is generated by Algorithm 1. Then either $g_{k} = 0$ for some finite k or the sequence $\left\{ \left\| g_{k}\right\| _{2}\right\}$ converges to zero R-linearly in the sense that

$$\begin{aligned} |g_{k}^{(i)}| \le C_{i}\theta ^{k}, \qquad i=1,2, \ldots , n, \end{aligned}$$

(45)

where ${\theta }=1-1/\kappa$ and

$$\begin{aligned} \left\{ \begin{array}{l} C_{1}=|g_{0}^{(1)}|; \\ C_{i}=\displaystyle \max \left\{ |g_{0}^{(i)}|,\frac{|g_{1}^{(i)}|}{\theta }, \ldots , \frac{|g_{r-1}^{(i)}|}{\theta ^{r-1}},\frac{\max \{\sigma _{i}, \sigma _{i}^{r}\}}{\theta ^{r}\psi (\lambda _{i})}\sqrt{\sum _{j=1}^{i-1}\psi ^{2}(\lambda _{j}) C_{j}^{2}}\right\} ,\\ \quad i = 2,3, \ldots n, \end{array}\right. \end{aligned}$$

with $\sigma _{i}=\max \left\{ \frac{\lambda _{i}}{\lambda _{1}}-1,1-\frac{\lambda _{i}}{\lambda _{n}}\right\}$.

Proof

It follows from $g_{k+1}^{(i)}=\left( 1-\alpha _{k} \lambda _{i}\right) g_{k}^{(i)}$ and (i) of Property B that

$$\begin{aligned} \left| g_{k+1}^{(i)}\right| \le \sigma _{i}\left| g_{k}^{(i)}\right| , \quad i=1,2, \ldots , n. \end{aligned}$$

(46)

Clearly, (45) holds for $i=1$.

In what follows, we prove (45) by induction on i. Assume that (45) holds for all $1 \le i \le L-1$ with $L \in \{2, \ldots , n\}$. When $i=L$, it follows from the definition of $C_{L}$ that (45) holds for $k = 0, 1,\ldots , r-1$. Assume by contradiction that (45) does not hold for $k\ge r$ when $i=L$. Let $\hat{k}\ge r$ be the minimal index such that $\left| g_{\hat{k}}^{(L)}\right| > C_{L} \theta ^{\hat{k}}$. Then, if $\alpha _{\hat{k}-1} \lambda _{L} \le 1$, we get

$$\begin{aligned} \left| g_{\hat{k}}^{(L)}\right| =\left( 1-\alpha _{\hat{k}-1} \lambda _{L}\right) \left| g_{\hat{k}-1}^{(L)}\right| \le \theta \left| g_{\hat{k}-1}^{(L)}\right| \le C_{L} \theta ^{\hat{k}}, \end{aligned}$$

which contradicts our assumption. Thus we must have $\alpha _{\hat{k}-1} \lambda _{L} > 1$, which combines with (46) and $\theta <1$ gives that

$$\begin{aligned} \psi (\lambda _{L})|g_{\hat{k}-j}^{(L)}|&\ge \frac{\psi (\lambda _{L})|g_{\hat{k}}^{(L)}|}{{\sigma }_{L}^{j}}>\frac{\psi (\lambda _{L})C_{L} \theta ^{\hat{k}}}{{\sigma }_{L}^{j}} \ge \frac{\psi (\lambda _{L})C_{L} \theta ^{\hat{k}}}{\max \{\sigma _{L}, \sigma _{L}^{r}\}}\\&\ge \theta ^{\hat{k}-r}\sqrt{\sum _{i=1}^{L-1}\psi ^{2}(\lambda _{i}) C_{i}^{2}} \nonumber \ge \theta ^{\hat{k}-r}\sqrt{ \sum _{i=1}^{L-1}\left( \psi (\lambda _{i})g_{\hat{k}-j}^{(i)}\right) ^{2} \theta ^{2(j-\hat{k})}} \nonumber \\&=\theta ^{j-r} \sqrt{G(\hat{k}-j, L-1)} \ge \sqrt{G(\hat{k}-j, L-1)},\quad j \in [1, r], \end{aligned}$$

where $G(k, l)=\sum _{i=1}^{l}\left( \psi \left( \lambda _{i}\right) g_{k}^{(i)}\right) ^{2}$. This together with Theorem 4 yields that

$$\begin{aligned} \begin{aligned} \left| 1-\tilde{\alpha }_{\hat{k}-1}\lambda _{L}\right|&=\tilde{\alpha }_{\hat{k}-1} \lambda _{L}-1 \le \frac{\left( \psi (A) g_{\hat{k}-1}\right) ^{T}A^2\left( \psi (A) g_{\hat{k}-1}\right) }{\left( \psi (A) g_{\hat{k}-1}\right) ^{T} A^3\left( \psi (A) g_{\hat{k}-1}\right) } \lambda _{L}-1 \\&=\frac{\sum _{i=1}^{n}\left( \lambda _{L}-\lambda _{i}\right) \lambda _{i}^2\left( \psi \left( \lambda _{i}\right) g_{\hat{k}-1}^{(i)}\right) ^{2}}{\sum _{i=1}^{n} \lambda _{i}^3\left( \psi \left( \lambda _{i}\right) g_{\hat{k}-1}^{(i)}\right) ^{2}} \\&\le \frac{\left( \lambda _{L}-\lambda _{1}\right) \lambda _{1}^2 G(\hat{k}-1), L-1)}{\lambda _{1}^3 G(\hat{k}-1), L-1)+\lambda _{L}^3\left( \psi \left( \lambda _{L}\right) g_{\hat{k}-1}^{(L)}\right) ^{2}} \\&\ \le \frac{\lambda _{L}-\lambda _{1}}{\lambda _{L}+\lambda _{1}} < \theta . \end{aligned} \end{aligned}$$

Using similar arguments, we can show $\left| 1-\alpha _{\hat{k}}^{BB1}\lambda _{L}\right| \le \theta$. Thus, $\left| g_{\hat{k}}^{(L)}\right| \le \theta \left| g_{\hat{k}-1}^{(L)}\right| \le C_{L} \theta ^{\hat{k}}$ which contradicts our assumption. Hence (45) holds for all i. We complete the proof. $\square$

4 Numerical experiments

In this section, we provide numerical experiments of Algorithm 1 on solving quadratic problems. All our codes are written in MATLAB R2016b and carried out on a PC with an AMD Ryzen 5 PRO 2500U, 2.00 GHz processor and 8 GB of RAM running Windows 10 system.

For Algorithm 1, we use $\tilde{\alpha }_{k}$ with $\psi (A) =I$. Then, if we compute the vector $z = Ag_{k}$ at each iteration, and keep into memory $w = Ag_{k-1}$, $\tilde{\alpha }_{k-1}$ can be expressed as

$$\begin{aligned} c_{0}= & {} g_{k-1}^{T} g_{k-1}, \quad c_{1}=g_{k-1}^{T} w, \quad c_{2}=w^{T} w,\\ c_{3}= & {} \frac{g_{k}^{T} z-c_{1}+2 \alpha _{k-1} c_{2}}{\alpha _{k-1}^{2}}, \quad c_{4}=\frac{z^{T} z-c_{2}+2 \alpha _{k} c_{3}}{\alpha _{k}^{2}}. \end{aligned}$$

Hence only one matrix-vector product is required in per iteration for Algorithm 1.

Firstly, we compare Algorithm 1 with the BB1 [8], DY [13], ABBmin2 [28], SDC [29] and MGC [7] methods and the Alg.1 method in [25] (we note this method as SL) on solving the following quadratic problem [22]:

$$\begin{aligned} f(x)=\left( x-x^{*}\right) ^{T}V\left( x-x^{*}\right) , \end{aligned}$$

(47)

where $V = \mathrm {diag}\left\{ v_{1}, \ldots , v_{n}\right\}$ is a diagonal matrix and $x^{*}$ is randomly generated with components in $[-10,10]$. Five different distributions of the diagonal elements $v_{j}, j=1,2, \ldots , n$, are generated, see Table 1.

The parameter m in SL uses the value $m=5$ for the second problem set and $m = 6$ for other sets to get good performance. For ABBmin2, $\tau$ is set to 0.9 as suggested in [28]. For SDC and MGC, the parameter pairs (h, s) and $(d_{1}, d_{2})$ are all set to (8, 6), which are more efficient than other choices for this test. The parameters in Algorithm 1 are set to $\tau = 0.3$ and $r = 5$.

For all comparison methods, the stopping condition is $\left\| g_{k}\right\| _{2} \le \epsilon \left\| g_{0}\right\| _{2}$, where $\epsilon >0$ is a given tolerance. The problem dimension is set to $n = 1000$. For each problem set, three different tolerance parameters $\epsilon = 10^{-6}, 10^{-9}, 10^{-12}$ and condition numbers $\kappa = 10^{4}$, $10^{5}, 10^{6}$ are tested. For each value of $\kappa$ or $\epsilon$, 10 starting points are randomly generated in $[-10, 10]$ and the average results are presented in Table 2.

We see that Algorithm 1 clearly outperforms the BB1, DY, SDC and MGC methods. As compared with the SL method, Algorithm 1 is usually faster than it when a high accuracy is required. Moreover, Algorithm 1 often performs better than ABBmin2 for the second to fourth problem sets and is very competitive with ABBmin2 for the first and last problem sets.

Table 1 Distributions of $v_{j}$

Full size table

Table 2 The average number of iterations required by Algorithm 1, the BB1, DY, SL, SDC, MGC and ABBmin2 methods on quadratic problem (47) with spectral distributions in Table 1

Full size table

Table 3 Test problems from Suitesparse Matrix Collection

Full size table

Table 4 The average number of iterations required by Algorithm 1, the BB1, DY, SL, SDC, MGC and ABBmin2 methods on problem (4) with A given by Table 3

Full size table

Table 5 The CPU time required by Algorithm 1, the BB1, DY, SL, SDC, MGC and ABBmin2 methods on problem (4) with A given by Table 3

Full size table

Next, we compare the above methods on problem (4), where $b = Ax^{*}$ and $x^{*}$ is a random vector as before. We test 14 sparse matrices from the SuiteSparse Matrix Collection [30] listed in Table 3. The iteration stops when $\left\| g_{k}\right\| _{2} \le 10^{-6}\left\| g_{0}\right\| _{2}$. In our test, we choose the parameters so that each method achieves the best performance. Specifically, we set $m = 4$ for SL, $(h, s) = (3, 4)$ for SDC, $(d_{1},d_{2}) = (4,4)$ for MGC, $\tau = 0.3$ for ABBmin2 and $\tau = 0.1~\text { and }~r = 5$ for Algorithm 1 .

For each matrix, ten initial points between $-10$ and 10 are randomly generated. The average number of iterations of compared methods are listed in Table 4. We see that for most matrices Algorithm 1 has better performance than BB1, DY, SL, MGC, SDC and is competitive with ABBmin2 in the sense of number of iterations. Table 5 lists the average CPU time in seconds for those methods. We observe that Algorithm 1 takes less CPU time than BB1, DY, MGC, ABBmin2, and is as fast as SL and SDC. One important reason for this phenomenon is that our method reuses the stepsize $\tilde{\alpha }_{k-1}$ for r iterations which can reduce computational cost.

5 Conclusions and discussions

For a uniform analysis, we considered a family of gradient methods whose stepsize is provided by $\alpha _{k}$ in (9), which includes the Dai–Yang method (5) as a special case. It is proved the family zigzags between two directions in a subspace spanned by the two eigenvectors corresponding to the smallest and largest eigenvalues of the Hessian. In order to achieve the two dimensional quadratic termination of the family, we derived a short stepsize $\tilde{\alpha }_{k}$ (14) that converges to $1/\lambda _{n}$. By using the long BB stepsize and $\tilde{\alpha }_{k-1}$ in a new adaptive cyclic manner, we designed Algorithm 1 for unconstrained quadratic optimization. We proved that Algorithm 1 converges R-linearly at a rate of $1-1/\kappa$. Our numerical results on minimizing quadratic functions indicate the efficiency of Algorithm 1 over other recent successful gradient methods.

By using the same arguments as those in the proof of Theorem 1, we find stepsizes $\tilde{\beta }_{k}$ and $\tilde{\gamma }_{k}$ are such that (37) and (38) achieve the two dimensional quadratic termination, respectively. For the n dimensional quadratic problem, by Theorem 1 in [5], we obtain

$$\begin{aligned} \lim _{k \rightarrow \infty } \tilde{\beta }_{k}=\frac{1}{\lambda _{n}},~\lim _{k \rightarrow \infty } \tilde{\gamma }_{k}=\frac{1}{\lambda _{n}} \end{aligned}$$

and

$$\begin{aligned} \lim _{k \rightarrow \infty } \hat{\beta }_{k}=\frac{1}{\lambda _{1}},~\lim _{k \rightarrow \infty } \hat{\gamma }_{k}=\frac{1}{\lambda _{1}}. \end{aligned}$$

In addition, we have $\lim _{k \rightarrow \infty } \hat{\alpha }_{k}=1/\lambda _{1}$. However, $\hat{\alpha }_{k}$, $\hat{\beta }_{k}$ and $\hat{\gamma }_{k}$ would not be good approximations of $1/\lambda _{1}$, for more details see [36]. It is worth noting that the stepsize proposed in [28] is a special case of $\tilde{\beta }_{k}$ with $\psi (A)=I$. Moreover, it is not difficult to prove that $\tilde{\beta }_{k}$ and $\tilde{\gamma }_{k}$ are short stepsizes in the sense $\tilde{\beta }_{k}<c_{2}/c_{3}$ and $\tilde{\gamma }_{k}<c_{1}/c_{2}$. Hence, we can replace $\tilde{\alpha }_{k-1}$ in Algorithm 1 by $\tilde{\beta }_{k-1}$ and $\tilde{\gamma }_{k-1}$, which leads to two variants of Algorithm 1. Preliminary experimental results show that the two variants are competitive with Algorithm 1. Furthermore, we can obtain the same convergence results as Theorem 6.

The results of this paper show that the two dimensional quadratic termination property is useful for designing efficient gradient methods. It would be interesting to develop new algorithms for solving general unconstrained optimization problems based on such a property. We leave this as our future work.

6 Declarations

All data generated or analyzed during this study are included in this published article and are also available from the corresponding author on reasonable request.

References

Cauchy, A.: Méthode générale pour la résolution des systemes di’équations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
Google Scholar
Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
Article MathSciNet MATH Google Scholar
Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
Article MathSciNet MATH Google Scholar
Forsythe, G.E.: On the asymptotic directions of the $s$-dimensional optimum gradient method. Numer. Math. 11(1), 57–76 (1968)
Article MathSciNet MATH Google Scholar
Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: On the asymptotic convergence and acceleration of gradient methods. J. Sci. Comput. 90, 7 (2022)
Article MathSciNet MATH Google Scholar
Nocedal, J., Sartenaer, A., Zhu, C.: On the behavior of the gradient norm in the steepest descent method. Comput. Optim. Appl. 22(1), 5–35 (2002)
Article MathSciNet MATH Google Scholar
Zou, Q., Magoulès, F.: Fast gradient methods with alignment for symmetric linear systems without using Cauchy step. J. Comput. Math. 381, 113033 (2021)
Article MathSciNet MATH Google Scholar
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Article MathSciNet MATH Google Scholar
Raydan, M.: On the Barzilai-Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Liao, L.Z.: $R$-linear convergence of the Barzilai-Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
Article MathSciNet MATH Google Scholar
Fletcher, R.: On the Barzilai–Borwein method. In: Optimization and control with applications, pp. 235–256. Springer, New York (2005)
Raydan, M.: The Barzilai-Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
Article MathSciNet MATH Google Scholar
Yuan, Y.X.: Step-sizes for the gradient method. AMS/IP Stud. Adv. Math. 42(2), 785–796 (2008)
MathSciNet MATH Google Scholar
Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
Article MathSciNet MATH Google Scholar
Birgin, E.G., Martínez, J.M., Raydan, M., et al.: Spectral projected gradient methods: review and perspectives. J. Stat. Softw. 60(3), 539–559 (2014)
Article MATH Google Scholar
Dai, Y.H., Huang, Y.K., Liu, X.W.: A family of spectral gradient methods for optimization. Comput. Optim. Appl. 74(1), 43–65 (2019)
Article MathSciNet MATH Google Scholar
Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
MathSciNet MATH Google Scholar
Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal. 23(4), 707–716 (1986)
Article MathSciNet MATH Google Scholar
Huang, Y.K., Liu, H.: Smoothing projected Barzilai-Borwein method for constrained non-lipschitz optimization. Comput. Optim. Appl. 65(3), 671–698 (2016)
Article MathSciNet MATH Google Scholar
Huang, Y.K., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai-Borwein method for nonnegative matrix factorization. Data Min. Knowl. Discov. 29(6), 1665–1684 (2015)
Article MathSciNet MATH Google Scholar
Jiang, B., Dai, Y.H.: Feasible Barzilai-Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), 756–784 (2013)
Article MathSciNet MATH Google Scholar
Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
MathSciNet MATH Google Scholar
Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Mang. Optim. 1(2), 181 (2005)
Article MathSciNet MATH Google Scholar
Huang, Y.K., Dai, Y.H., Liu, X.W.: Equipping the Barzilai-Borwein method with the two dimensional quadratic termination property. SIAM J. Optim. 31(4), 3068–3096 (2021)
Article MathSciNet MATH Google Scholar
Sun, C., Liu, J.P.: New stepsizes for the gradient method. Optim. Lett. 14(7), 1943–1955 (2020)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Yang, X.: A new gradient method with an optimal stepsize property. Comput. Optim. Appl. 33(1), 73–88 (2006)
Article MathSciNet MATH Google Scholar
Elman, H.C., Golub, G.H.: Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal. 31(6), 1645–1661 (1994)
Article MathSciNet MATH Google Scholar
Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Mang. Optim. 4(2), 299–312 (2008)
Article MathSciNet MATH Google Scholar
De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comput. Optim. Appl. 59(3), 541–563 (2014)
Article MathSciNet MATH Google Scholar
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)
MathSciNet MATH Google Scholar
Crisci, S., Porta, F., Ruggiero, V., Zanni, L.: Spectral properties of Barzilai-Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds. SIAM J. Optim. 30(2), 1300–1326 (2020)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Fletcher, R.: Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 100(1), 21–47 (2005)
Article MathSciNet MATH Google Scholar
Dai, Y.H., Hager, W.W., Schittkowski, K., Zhang, H.: The cyclic Barzilai-Borwein method for unconstrained optimization. IMA J. Numer. Anal. 26(3), 604–627 (2006)
Article MathSciNet MATH Google Scholar
Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: On the acceleration of the Barzilai-Borwein method. Comput. Optim. Appl. 81(3), 717–740 (2022)
Article MathSciNet MATH Google Scholar
Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comput. Optim. Appl. 35(1), 69–86 (2006)
Article MathSciNet MATH Google Scholar
Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: Gradient methods exploiting spectral properties. Optim. Method Softw. 35(4), 681–705 (2020)
Article MathSciNet MATH Google Scholar
Li, X., Huang, Y.K.: A note on the $R$-linear convergence of nonmonotone gradient methods. arXiv preprint, arXiv:2207.05912 (2022)
Dai, Y.H.: Alternate step gradient method. Optimization 52(4–5), 395–415 (2003)
Article MathSciNet MATH Google Scholar
Huang, N.: On $R$-linear convergence analysis for a class of gradient methods. Comput. Optim. Appl. 81(1), 161–177 (2022)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 11701137) and Natural Science Foundation of Hebei Province (Grant No. A2021202010).

Author information

Authors and Affiliations

School of Sciences, Hebei University of Technology, Tianjin, 300401, China
Xinrui Li
Institute of Mathematics, Hebei University of Technology, Tianjin, 300401, China
Yakui Huang

Authors

Xinrui Li
View author publications
You can also search for this author in PubMed Google Scholar
Yakui Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yakui Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, X., Huang, Y. A gradient method exploiting the two dimensional quadratic termination property. Optim Lett 17, 1413–1434 (2023). https://doi.org/10.1007/s11590-022-01936-z

Download citation

Received: 10 January 2022
Accepted: 13 September 2022
Published: 22 September 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11590-022-01936-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A gradient method exploiting the two dimensional quadratic termination property

Abstract

Similar content being viewed by others

On R-linear convergence analysis for a class of gradient methods

On the Asymptotic Convergence and Acceleration of Gradient Methods

A Gradient-Based Globalization Strategy for the Newton Method

1 Introduction

2 A new stepsize and its properties

Lemma 1

Proof

Theorem 1

Proof

Lemma 2

Theorem 2

Proof

Corollary 1

Theorem 3

Proof

3 A new gradient method

Theorem 4

Proof

Theorem 5

Proof

Theorem 6

Proof

4 Numerical experiments

5 Conclusions and discussions

6 Declarations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A gradient method exploiting the two dimensional quadratic termination property

Abstract

Similar content being viewed by others

On R-linear convergence analysis for a class of gradient methods

On the Asymptotic Convergence and Acceleration of Gradient Methods

A Gradient-Based Globalization Strategy for the Newton Method

1 Introduction

2 A new stepsize and its properties

Lemma 1

Proof

Theorem 1

Proof

Lemma 2

Theorem 2

Proof

Corollary 1

Theorem 3

Proof

3 A new gradient method

Theorem 4

Proof

Theorem 5

Proof

Theorem 6

Proof

4 Numerical experiments

5 Conclusions and discussions

6 Declarations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation