On the Asymptotic Convergence and Acceleration of Gradient Methods

Huang, Yakui; Dai, Yu-Hong; Liu, Xin-Wei; Zhang, Hongchao

doi:10.1007/s10915-021-01685-8

On the Asymptotic Convergence and Acceleration of Gradient Methods

Published: 19 November 2021

Volume 90, article number 7, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Scientific Computing Aims and scope Submit manuscript

On the Asymptotic Convergence and Acceleration of Gradient Methods

Download PDF

Yakui Huang¹,
Yu-Hong Dai^2,3,
Xin-Wei Liu¹ &
…
Hongchao Zhang⁴

476 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

We consider the asymptotic behavior of a family of gradient methods, which include the steepest descent and minimal gradient methods as special instances. It is proved that each method in the family will asymptotically zigzag between two directions. Asymptotic convergence results of the objective value, gradient norm, and stepsize are presented as well. To accelerate the family of gradient methods, we further exploit spectral properties of stepsizes to break the zigzagging pattern. In particular, a new stepsize is derived by imposing finite termination on minimizing two-dimensional strictly convex quadratic function. It is shown that, for the general quadratic function, the proposed stepsize asymptotically converges to the reciprocal of the largest eigenvalue of the Hessian. Furthermore, based on this spectral property, we propose a periodic gradient method by incorporating the Barzilai-Borwein method. Numerical comparisons with some recent successful gradient methods show that our new method is very promising.

On R-linear convergence analysis for a class of gradient methods

Article 17 November 2021

A gradient method exploiting the two dimensional quadratic termination property

Article 22 September 2022

A Modified Spectral Conjugate Gradient Method with Global Convergence

Article 06 April 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The gradient method is well-known for solving the following unconstrained optimization

$$\begin{aligned} \min _{x\in \mathbb {R}^n}~f(x), \end{aligned}$$

(1)

where $f: \mathbb {R}^n \rightarrow \mathbb {R}$ is continuously differentiable, especially when the dimension n is large. In particular, at the k-th iteration gradient methods update the iterates by

$$\begin{aligned} x_{k+1}=x_k-\alpha _kg_k, \end{aligned}$$

(2)

where $g_k=\nabla f(x_k)$ and $\alpha _k>0$ is the stepsize determined by the method.

One simplest nontrivial nonlinear instance of (1) is the quadratic optimization

$$\begin{aligned} \min _{x\in \mathbb {R}^n}~f(x)=\frac{1}{2}x ^T Ax-b ^T x, \end{aligned}$$

(3)

where $b\in \mathbb {R}^n$ and $A\in \mathbb {R}^{n\times n}$ is symmetric and positive definite. Solving (3) efficiently is usually a pre-requisite for a method to be generalized to solve more general optimization. In addition, by Taylor’s expansion, a general smooth function can be approximated by a quadratic function near the minimizer. So, the local convergence behaviors of gradient methods are often reflected by solving (3). Hence, in this paper, we focus on studying the convergence behaviors and propose efficient gradient methods for solving (3) efficiently.

The classic steepest descent (SD) method proposed by Cauchy [4] solves (3) by using the exact stepsize

$$\begin{aligned} \alpha _k^{SD}=\arg \min _{\alpha }~f(x_k-\alpha g_k)=\frac{g_{k} ^T g_{k}}{g_{k} ^T Ag_{k}}. \end{aligned}$$

(4)

Although $\alpha _k^{SD}$ minimizes f along the steepest descent direction, the SD method often performs poorly in practice and has linear converge rate [1, 18] as

$$\begin{aligned} \frac{f(x_{k+1})-f^*}{f(x_{k})-f^*}\le \left( \frac{\kappa -1}{\kappa +1}\right) ^2, \end{aligned}$$

(5)

where $f^*$ is the optimal function value of (3) and $\kappa =\lambda _n/\lambda _1$ is the condition number of A with $\lambda _1$ and $\lambda _n$ being the smallest and largest eigenvalues of A, respectively. Thus, if $\kappa $ is large, the SD method may converge very slowly. In addition, Akaike [1] proved that the gradients will asymptotically alternate between two directions in the subspace spanned by the two eigenvectors corresponding to $\lambda _1$ and $\lambda _n$. So, the SD method often has zigzag phenomenon near the solution. Forsythe [18] generalized Akaike’s results to the so-called optimum s-gradient method and Pronzato et al. [27] further generalized the results to the so-called P-gradient method in the Hilbert space. Recently, by employing Akaike’s results, Nocedal et al. [26] presented some insights for asymptotic behaviors of the SD method on function values, stepsizes and gradient norms.

Contrary to the SD method, the minimal gradient (MG) method (see Dai and Yuan [10]) computes its stepsize by minimizing the gradient norm,

$$\begin{aligned} \alpha _k^{MG}=\arg \min _{\alpha }~\Vert \nabla f(x_k-\alpha g_k)\Vert =\frac{g_{k} ^T Ag_{k}}{g_{k} ^T A^2g_{k}}. \end{aligned}$$

(6)

where $\Vert \cdot \Vert $ is the 2-norm. It is widely accepted that the MG method can also perform poorly and has similar asymptotic behavior as the SD method, i.e., it will asymptotically zigzag in a two-dimensional subspace. Zhou et al. [33] provided some interesting analyses on $\alpha _k^{MG}$ for minimizing two-dimensional quadratics. However, rigorous asymptotic convergence results of the MG method for minimizing general quadratic functions are very limit in literature.

In order to avoid the zigzagging pattern, it is useful to determine the stepsize without using the exact stepsize because it would yield a gradient perpendicular to the current one. Barzilai and Borwein [2] proposed the following two novel stepsizes:

$$\begin{aligned} \alpha _k^{BB1}=\frac{s_{k-1} ^T s_{k-1}}{s_{k-1} ^T y_{k-1}}~~\text {and}~~ \alpha _k^{BB2}=\frac{s_{k-1} ^T y_{k-1}}{y_{k-1} ^T y_{k-1}}, \end{aligned}$$

(7)

where $s_{k-1}=x_k-x_{k-1}$ and $y_{k-1}=g_k-g_{k-1}$. The Barzilai-Borwein (BB) method (7) performs quite well in practice, though it generates a nonmonotone sequence of objective values. Due to its simplicity and efficiency, the BB method has been widely studied [6,7,8, 17, 28] and extended to general problems and various applications [3, 22,23,24,25, 29]. Another line of research to break the zigzagging pattern and accelerate the convergence is occasionally applying short stepsizes that approximate $1/\lambda _n$ to eliminate the corresponding component of the gradient. One seminal work is due to Yuan [31, 32], who derived the following stepsize:

$$\begin{aligned} \alpha _k^{Y}=\frac{2}{\frac{1}{\alpha _{k-1}^{SD}}+\frac{1}{\alpha _{k}^{SD}}+ \sqrt{\left( \frac{1}{\alpha _{k-1}^{SD}}-\frac{1}{\alpha _{k}^{SD}}\right) ^2+ \frac{4\Vert g_k\Vert ^2}{(\alpha _{k-1}^{SD}\Vert g_{k-1}\Vert )^2}}}. \end{aligned}$$

(8)

Dai and Yuan [11] further suggested a new gradient method with

$$\begin{aligned} \alpha _k^{DY}=\left\{ \begin{array}{ll} \alpha _k^{SD}, &{}\quad \hbox {if mod}(k,4)<2; \\ \alpha _k^{Y}~~, &{}\quad \hbox {otherwise.} \end{array}\right. \end{aligned}$$

(9)

The Dai-Yuan (DY) method (9) is a monotone method and appears very competitive with the nonmonotone BB method. Recently, by employing the results in [1] and [26], De Asmundis et al. [12] show that the stepsize $\alpha _k^{Y}$ converges to $1/\lambda _n$ if the SD method is applied to problem (3). This spectral property is the key to break the zigzagging pattern.

In [9], Dai and Yang developed the asymptotic optimal gradient (AOPT) method whose stepsize is given by

$$\begin{aligned} \alpha _k^{AOPT}=\frac{\Vert g_k\Vert }{\Vert Ag_k\Vert }. \end{aligned}$$

(10)

Unlike the DY method, the AOPT method only has one stepsize. In addition, they show that $\alpha _k^{AOPT}$ asymptotically converges to $\frac{2}{\lambda _1+\lambda _n}$, which is in some sense an optimal stepsize since it minimizes $\Vert I-\alpha A\Vert $ over $\alpha $ [9, 16]. However, the AOPT method also asymptotically alternates between two directions. To accelerate the AOPT method, Huang et al. [21] derived a new stepsize that converges to $1/\lambda _n$ during the AOPT iterates and further suggested a gradient method to exploit spectral properties of the stepsizes. For the latest developments of exploiting spectral properties to accelerate gradient methods, see [12,13,14, 20, 21] and references therein.

In this paper, we present the analysis on the asymptotic behaviors of gradient methods and the techniques for breaking the zigzagging pattern. For a uniform analysis, we consider the following stepsize

$$\begin{aligned} \alpha _k=\frac{g_k ^T \varPsi (A) g_k}{g_k ^T \varPsi (A)A g_k}, \end{aligned}$$

(11)

where $\varPsi $ is a real analytic function on $[\lambda _1,\lambda _n]$ and can be expressed by Laurent series

$$\begin{aligned} \varPsi (z)=\sum _{k=-\infty }^\infty c_kz^k,~~c_k\in \mathbb {R}, \end{aligned}$$

such that $0<\sum _{k=-\infty }^\infty c_kz^k<+\infty $ for all $z\in [\lambda _1,\lambda _n]$. Apparently, $\alpha _k$ is a family of stepsizes that would give a family of gradient methods. When $\varPsi (A)=A^u$ for some nonnegative integer u, we get the following stepsize

$$\begin{aligned} \alpha _k=\frac{g_k ^T A^{u} g_k}{g_k ^T A^{u+1} g_k}. \end{aligned}$$

(12)

The $\alpha _k^{SD}$ and $\alpha _k^{MG}$ simply correspond to the cases $u=0$ and $u=1$, respectively.

We will present theoretical analysis on the asymptotic convergence on the family of gradient methods whose stepsize can be written in the form (11), which provides justifications for the zigzag behaviors of all these gradient methods including the SD and MG methods. In particular, we show that each method in the family (11) will asymptotically alternate between two directions associated with the two eigenvectors corresponding to $\lambda _1$ and $\lambda _n$. Moreover, we analyze the asymptotic behaviors of the objective value, gradient norm, and stepsize. It is shown that, when $\varPsi (A)\ne I$, the two sequences $\Big \{\frac{\varDelta _{2k+1}}{\varDelta _{2k}} \Big \}$ and $\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k+1}} \Big \}$ may converge at different speeds, while the odd and even subsequences $\Big \{\frac{\varDelta _{2k+3}}{\varDelta _{2k+1}}\Big \}$ and $\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k}}\Big \}$ converge at the same rate, where $\varDelta _k = f(x_k) - f^* $. Similar property is also possessed by the gradient norm sequence. In addition, each method in the family (11) has the same worst asymptotic convergence rate.

In order to accelerate the gradient methods (11), we investigate techniques for breaking the zigzagging pattern. We derive a new stepsize $\tilde{\alpha }_k$ based on finite termination for minimizing two-dimensional strictly convex quadratic functions. For the n-dimensional case, we prove that $\tilde{\alpha }_k$ converges to $1/\lambda _n$ when gradient methods (11) are applied to problem (3). Furthermore, based on this spectral property, we propose a periodic gradient method, which, in a periodic mode, alternately uses the BB stepsize, stepsize (11) and our new stepsize $\tilde{\alpha }_k$. Numerical comparisons of the proposed method with the BB [2], DY [11], ABBmin2 [19], SDC [12] methods and Alg. 1 in [30] show that the new gradient method is very promising. Our theoretical results also significantly improve and generalize those in [1] and [26], where only the SD method (i.e., $\varPsi (A)=I$) is considered. We point out that [27] does not analyze the asymptotic behaviors of the objective value, gradient norm, and stepsize, though (11) is similar to their P-gradient method. Moreover, we develop techniques for accelerating these zigzag methods with simpler analysis. Notice that $\alpha _k^{AOPT}$ can not be written in the form (11). Thus, our results are not applicable to the AOPT method. On the other hand, the analysis of the AOPT method presented by [9] can not be applied directly to the family of methods (11).

The paper is organized as follows. In Sect. 2, we analyze the asymptotic behaviors of the family of gradient methods (11). In Sect. 3, we accelerate the gradient methods (11) by developing techniques to break the zigzagging pattern and propose a new periodic gradient method. Numerical experiments are presented in Sect. 4. Finally, some conclusions and discussions are made in Sect. 5.

2 Asymptotic Behavior of the Family (11)

In this section, we present a uniform analysis on the asymptotic behavior of the family of gradient methods (11) for general n-dimensional strictly convex quadratics.

Let $\{\lambda _1,\lambda _2,\cdots ,\lambda _n\}$ be the eigenvalues of A, and $\{\xi _1,\xi _2,\ldots ,\xi _n\}$ be the associated orthonormal eigenvectors. Note that the gradient method is invariant under translations and rotations when applying to a quadratic function. As pointed out by Fletcher in [17], we can combine those gradient components if there are any multiple eigenvalues. Thus, for theoretical analysis, we assume without loss of generality that

$$\begin{aligned} A=\text {diag}\{\lambda _1,\lambda _2,\cdots ,\lambda _n\},~~0<\lambda _1<\lambda _2<\cdots <\lambda _n. \end{aligned}$$

(13)

Denoting the components of $g_k$ along the eigenvectors $\xi _i$ by $\mu _k^{(i)}$, $i=1,\ldots ,n$, i.e.,

$$\begin{aligned} g_k=\sum _{i=1}^n\mu _k^{(i)}\xi _i. \end{aligned}$$

(14)

The above decomposition of gradient $g_k$ together with the update rule (2) gives that

$$\begin{aligned} g_{k+1}=g_k-\alpha _kAg_k=\prod _{j=0}^k(I-\alpha _jA)g_0=\sum _{i=1}^n\mu _{k+1}^{(i)}\xi _i, \end{aligned}$$

(15)

where

$$\begin{aligned} \mu _{k+1}^{(i)}=(1-\alpha _k\lambda _i)\mu _k^{(i)}=\mu _0^{(i)}\prod _{j=0}^k(1-\alpha _j\lambda _i). \end{aligned}$$

(16)

Defining the vector $q_k=\left( q_k^{(i)}\right) $ with

$$\begin{aligned} q_k^{(i)}=\frac{\left( \mu _k^{(i)}\right) ^2}{\Vert \mu _k\Vert ^2} \end{aligned}$$

(17)

and

$$\begin{aligned} \gamma _k=\frac{1}{\alpha _k}=\frac{g_k ^T \varPsi (A)A g_k}{g_k ^T \varPsi (A) g_k} =\frac{\sum _{i=1}^n\varPsi (\lambda _i)\lambda _iq_k^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)q_k^{(i)}}, \end{aligned}$$

(18)

we can have from (16), (17) and (18) that

$$\begin{aligned} q_{k+1}^{(i)}=\frac{(\lambda _i-\gamma _k)^2q_k^{(i)}}{\sum _{i=1}^n(\lambda _i-\gamma _k)^2q_k^{(i)}}. \end{aligned}$$

(19)

In addition, by the definition of $q_k$, we know that $q_k^{(i)}\ge 0$ for all i and

$$\begin{aligned} \sum _{i=1}^nq_k^{(i)}=1,~~\forall ~~k\ge 1. \end{aligned}$$

Before establishing the asymptotic convergence of the family of gradient methods (11), we first give some lemmas on the properties of the sequence $\{q_k\}$.

Lemma 1

Suppose $p\in \mathbb {R}^n$ satisfies (i) $p^{(i)}\ge 0$ for all $i=1,2,\ldots ,n$; (ii) there exist at least two $i's$ with $p^{(i)}>0$; and (iii) $\sum _{i=1}^np^{(i)}=1$. Define $T: \mathbb {R}^n\rightarrow \mathbb {R}$ be the following transformation:

$$\begin{aligned} (Tp)^{(i)}=\frac{(\lambda _i-\gamma (p))^2p^{(i)}}{\sum _{i=1}^n(\lambda _i-\gamma (p))^2p^{(i)}}, \end{aligned}$$

(20)

where

$$\begin{aligned} \gamma (p)=\frac{\sum _{i=1}^n\varPsi (\lambda _i)\lambda _ip^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}}. \end{aligned}$$

(21)

Then we have

$$\begin{aligned} \varTheta (Tp)\ge \varTheta (p), \end{aligned}$$

(22)

where

$$\begin{aligned} \varTheta (p)=\frac{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}}. \end{aligned}$$

(23)

In addition, (22) holds with equality if and only if there are two indices, say $i_1$ and $i_2$, such that $p^{(i)}=0$ for all $i\notin \{i_1,i_2\}$ and

$$\begin{aligned} \gamma (Tp)+\gamma (p)=\lambda _{i_1}+\lambda _{i_2}. \end{aligned}$$

(24)

Proof

It follows from the definition of Tp that

$$\begin{aligned} \varTheta (Tp)&=\frac{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(Tp)^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)(Tp)^{(i)}} \nonumber \\&=\frac{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(\lambda _i-\gamma (p))^2p^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)}}. \end{aligned}$$

(25)

Let us define two vectors $w = (w_i) \in \mathbb {R}^n$ and $z = (z_i) \in \mathbb {R}^n$ by

$$\begin{aligned} w_i=\sqrt{\varPsi (\lambda _i)}(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))\sqrt{p^{(i)}} \end{aligned}$$

and

$$\begin{aligned} z_i=\sqrt{\varPsi (\lambda _i)}\sqrt{p^{(i)}}. \end{aligned}$$

Then, we have from the Cauchy-Schwarz inequality that

$$\begin{aligned} \Vert w\Vert ^2\Vert z\Vert ^2&=\left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(\lambda _i-\gamma (p))^2p^{(i)}\right) \left( \sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}\right) \nonumber \\&\ge (w ^T z)^2=\left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))p^{(i)}\right) ^2. \end{aligned}$$

(26)

Using the definition of $\gamma (p)$, we can obtain that

$$\begin{aligned}&\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))p^{(i)} -\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)} \nonumber \\&\quad = (\gamma (p)-\gamma (Tp))\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))p^{(i)} =0, \end{aligned}$$

(27)

which together with (26) gives

$$\begin{aligned}&\left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(\lambda _i-\gamma (p))^2p^{(i)}\right) \left( \sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}\right) \nonumber \\&\quad \ge \left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)}\right) ^2. \end{aligned}$$

(28)

Then, the inequality (22) follows immediately.

The equality in (26) holds if and only if

$$\begin{aligned} \sqrt{\varPsi (\lambda _i)}(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))\sqrt{p^{(i)}} =C\sqrt{\varPsi (\lambda _i)}\sqrt{p^{(i)}},~~i=1,\ldots ,n \end{aligned}$$

(29)

for some nonzero scalar C. Clearly, (29) holds when $p^{(i)}=0$. Suppose that there exist two indices $i_1$ and $i_2$ such that $p^{(i_1)},p^{(i_2)}>0$. It follows from (29) that

$$\begin{aligned} (\lambda _{i_1}-\gamma (Tp))(\lambda _{i_1}-\gamma (p)) =(\lambda _{i_2}-\gamma (Tp))(\lambda _{i_2}-\gamma (p)). \end{aligned}$$

So, by the assumption (13), we have

$$\begin{aligned} \lambda _{i_1}+\lambda _{i_2}=\gamma (Tp)+\gamma (p), \end{aligned}$$

which again with assumption (13) imply that (29) holds if and only if p has only two nonzero components and (24) holds. $\square $

Lemma 2

Let $p_*\in \mathbb {R}^n$ satisfy the conditions of Lemma 1 and T be the transformation (20). If $p_*$ has only two nonzero components $p_*^{(i_1)}$ and $p_*^{(i_2)}$, we have

$$\begin{aligned}&(Tp_*)^{(i_1)}=\frac{\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}, \end{aligned}$$

(30)

$$\begin{aligned}&(Tp_*)^{(i_2)} = \frac{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}, \end{aligned}$$

(31)

$$\begin{aligned}&(T^2p_*)^{(i_1)} = p_*^{(i_1)},~~(T^2p_*)^{(i_2)}=p_*^{(i_2)}, \end{aligned}$$

(32)

and

$$\begin{aligned} \gamma (p_*)+\gamma (Tp_*)=\lambda _{i_1}+\lambda _{i_2}, \end{aligned}$$

(33)

where the function $\gamma $ is defined in (21). Moreover, $p_*=Tp_*$ if and only if

$$\begin{aligned} p_*^{(i_1)}=\frac{\varPsi (\lambda _{i_2})}{\varPsi (\lambda _{i_1})+\varPsi (\lambda _{i_2})} \quad \text{ and } \quad p_*^{(i_2)}=\frac{\varPsi (\lambda _{i_1})}{\varPsi (\lambda _{i_1})+\varPsi (\lambda _{i_2})}. \end{aligned}$$

(34)

Proof

By the definition of $\gamma (p)$, we have

$$\begin{aligned} \gamma (p_*)=\frac{\varPsi (\lambda _{i_1})\lambda _{i_1}p_*^{(i_1)} +\varPsi (\lambda _{i_2})\lambda _{i_2}p_*^{(i_2)}}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}}, \end{aligned}$$

(35)

which indicates that

$$\begin{aligned} \lambda _{i_1}-\gamma (p_*)=\frac{\varPsi (\lambda _{i_2})p_*^{(i_2)}(\lambda _{i_1}-\lambda _{i_2})}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}} \end{aligned}$$

and

$$\begin{aligned} \lambda _{i_2}-\gamma (p_*)=\frac{\varPsi (\lambda _{i_1})p_*^{(i_1)}(\lambda _{i_2}-\lambda _{i_1})}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$

Then, it follows from the definition of transformation T that

$$\begin{aligned} (Tp_*)^{(i_1)}&=\frac{(\varPsi (\lambda _{i_2})p_*^{(i_2)})^2p_*^{(i_1)}}{(\varPsi (\lambda _{i_2})p_*^{(i_2)})^2p_*^{(i_1)}+(\varPsi (\lambda _{i_1})p_*^{(i_1)})^2p_*^{(i_2)}}\\&=\frac{\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$

This gives (30). Equation (31) can be proved similarly. By (30) and (31), we have

$$\begin{aligned} (T^2p_*)^{(i_1)}&=\frac{\varPsi ^2(\lambda _{i_2})(Tp_*)^{(i_2)}}{\varPsi ^2(\lambda _{i_1})(Tp_*)^{(i_1)}+\varPsi ^2(\lambda _{i_2})(Tp_*)^{(i_2)}}\\&=\frac{\varPsi ^2(\lambda _{i_1})\varPsi ^2(\lambda _{i_2})p_*^{(i_1)}}{\varPsi ^2(\lambda _{i_1})\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}+ \varPsi ^2(\lambda _{i_1})\varPsi ^2(\lambda _{i_2})p_*^{(i_1)}}\\&=\frac{p_*^{(i_1)}}{p_*^{(i_1)}+p_*^{(i_2)}}=p_*^{(i_1)}. \end{aligned}$$

The result of $(T^2p_*)^{(i_2)}$ follows similarly. This proves (32).

Again by (30), (31) and the definition of function $\gamma $ in (21), we have

$$\begin{aligned} \gamma (Tp_*)=\frac{\lambda _{i_1}\varPsi (\lambda _{i_2})p_*^{(i_2)}+ \lambda _{i_2}\varPsi (\lambda _{i_1})p_*^{(i_1)}}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$

(36)

Then, the equality (33) follows from (35) and (36). For (34), let

$$\begin{aligned} p_*^{(i_1)}=\frac{\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$

Rearranging terms and using $p_*^{(i_1)}+p_*^{(i_2)}=1$, we have

$$\begin{aligned} \varPsi ^2(\lambda _{i_1})(p_*^{(i_1)})^2=\varPsi ^2(\lambda _{i_2})(p_*^{(i_2)})^2, \end{aligned}$$

which implies that

$$\begin{aligned} \varPsi (\lambda _{i_1})p_*^{(i_1)}=\varPsi (\lambda _{i_2})p_*^{(i_2)}. \end{aligned}$$

This together with the fact $p_*^{(i_1)}+p_*^{(i_2)}=1$ yields (34). $\square $

Lemma 3

Let $p\in \mathbb {R}^n$ satisfy the conditions of Lemma 1 and T be the transformation (20). Then, there exists a $p_*$ satisfying

$$\begin{aligned} \lim _{k\rightarrow \infty }T^{2k}p=p_*~~and~~ \lim _{k\rightarrow \infty }T^{2k+1}p=Tp_*, \end{aligned}$$

(37)

where $p_*$ and $Tp_*$ have only two nonzero components satisfying

$$\begin{aligned}&p_*^{(i_1)}+p_*^{(i_2)}=1, ~~p_*^{(i)}=0,~~i\ne i_1,i_2, \end{aligned}$$

(38)

$$\begin{aligned}&(Tp_*)^{(i_1)}+(Tp_*)^{(i_2)}=1,~~(Tp_*)^{(i)}=0,~~i\ne i_1,i_2, \end{aligned}$$

(39)

for some $i_1,i_2\in \{1,\ldots ,n\}$. Hence, (30), (31), (32) and (33) hold.

Proof

Let $p_0=T^0p=p$ and $p_{k}=T{p_{k-1}}=T^{k}{p_0}$. Obviously, for all $k\ge 0$, $p_{k}$ satisfies (i) and (iii) of Lemma 1. Let $i_{\min } = \min \{i \in \mathcal {N}: p_0^{(i)}>0\}$ and $i_{\max } = \max \{i \in \mathcal {N}: p_0^{(i)}>0\}$, where $\mathcal {N} = \{1, \ldots , n\}$. From the definition of $\gamma $, we know $\lambda _{i_{\min }}< \gamma (p)<\lambda _{i_{\max }}$. Thus, by the definition of T, we have $p_1^{({i_{\min }})}>0$ and $p_1^{({i_{\max }})}>0$. Then, by induction, for all $k\ge 0$, $p_{k}$ satisfies (ii) of Lemma 1. So, by Lemma 1, $\{\varTheta (p_k)\}$ is a monotonically increasing sequence. Since $\lambda _1\le \gamma (p)\le \lambda _n$, we have $(\lambda _i-\gamma (p))^2\le (\lambda _n-\lambda _1)^2$. Hence, we have from the definition of $\varTheta $ that $\varTheta (p_k)\le (\lambda _n-\lambda _1)^2 $. Thus, $\{\varTheta (p_k)\}$ is convergent. Let $\varTheta _*=\lim _{k\rightarrow \infty }\varTheta (p_k) >0$.

Denote the set of all limit points of $\{p_k\}$ by $P_*$ with cardinality $|P_*|$. Since $\{p_k\}$ is bounded, $|P_*|\ge 1$. For any subsequence $\{p_{k_j}\}$ converging to some $p_*\in P_*$, we have

$$\begin{aligned} \lim _{j\rightarrow \infty }\varTheta (p_{k_j})=\varTheta (p_*) \quad \text{ and } \quad \lim _{j\rightarrow \infty }\varTheta (Tp_{k_j})=\varTheta (Tp_*), \end{aligned}$$

by the continuity of $\varTheta $ and T. Notice $p_{k_j+1}=Tp_{k_j}$, we have $ \varTheta _*=\varTheta (p_*)=\varTheta (Tp_*)$.

Since $p_{k}$ satisfies (i)–(iii) of Lemma 1 for all $k\ge 0$, $p_*$ must satisfy (i) and (iii). If $p_*$ has only one positive component, we have $\varTheta (p_*)=0$ which contradicts $\varTheta (p_*) = \varTheta _* > 0 $. Hence, by Lemmas 1, 2 and $\varTheta (p_*)=\varTheta (Tp_*)$, $p_*$ has only two nonzero components, say $p_*^{(i_1)}$ and $p_*^{(i_2)}$, and their values are uniquely determined by the indices $i_1$, $i_2$ and the eigenvalues $\lambda _{i_1}$ and $\lambda _{i_2}$. This implies $|P_*| < \infty $. Furthermore, by Lemma 2, for any $p_*\in P_*$, $Tp_*$ is given by (30) and (31), and $Tp_*\in P_*$.

We now show that $|P_*|\le 2$ by way of contradiction. Suppose $|P_*| \ge 3 $. For any $p_*\in P_*$ and $Tp_*\in P_*$, denote $\delta _1$ and $\delta _2$ to be the distance from $p_*$ to $P_*\setminus \{p_*\}$ and from $Tp_*$ to $P_*\setminus \{Tp_*\}$, respectively. Since $ 3 \le |P_*| < \infty $, we have $\delta _1>0$, $\delta _2>0$ and there exists an infinite subsequence $\{p_{k_j}\}$ such that

$$\begin{aligned} p_{k_j}\rightarrow p_*, \quad \text{ and } \quad p_{k_j+1}=Tp_{k_j}\rightarrow Tp_*, \end{aligned}$$

but $p_{k_j+2}\notin \mathcal {B}\left( p_*,\frac{1}{2}\delta \right) \cup \mathcal {B}\left( Tp_*,\frac{1}{2}\delta \right) $, where $\delta =\min \{\delta _1,\delta _2\}$ and $\mathcal {B}(p_*,r)=\{p: \Vert p-p_*\Vert \le r\}$. However, by (32) we have $T^2p_*=p_*$. Hence, by continuity of T,

$$\begin{aligned} \lim _{j\rightarrow \infty }p_{k_j+2}=\lim _{j\rightarrow \infty }Tp_{k_j+1}=\lim _{j\rightarrow \infty }T^2p_{k_j}= p_*, \end{aligned}$$

which contradicts the choice of $p_{k_j+2}\notin \mathcal {B}\left( p_*,\frac{1}{2}\delta \right) $. Thus, $\{p_k\}$ has at most two limit points $p_*$ and $Tp_*$, and both have only two nonzero components.

Now, we assume that $p_*$ is a limit point of $\{p_{2k}\}$. Since $T^2p_*=p_*$, all subsequences of $\{p_{2k}\}$ have the same limit point, i.e., $p_{2k}=T^{2k}p\rightarrow p_*$. Similarly, we have $T^{2k+1}p\rightarrow Tp_*$. Then, (38) and (39) follow directly from the analysis. $\square $

Based on the above analysis, we can show that each gradient method in (11) will asymptotically reduces its search in a two-dimensional subspace spanned by the two eigenvectors $\xi _1$ and $\xi _n$.

Theorem 1

Assume that the starting point $x_0$ has the property that

$$\begin{aligned} g_0 ^T \xi _1\ne 0~~\text {and}~~g_0 ^T \xi _n\ne 0. \end{aligned}$$

(40)

Let $\{x_k\}$ be the iterations generated by applying a method in (11) to solve problem (3). Then

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{{(\mu _{2k}^{(i)})^2}}{\sum _{j=1}^n(\mu _{2k}^{(j)})^2}= \left\{ \begin{array}{ll} \displaystyle \frac{1}{1+c^2}, &{}\quad \hbox {if}\, i=1, \\ 0, &{}\quad \hbox {if}\, i=2,\ldots ,n-1, \\ \displaystyle \frac{c^2}{1+c^2}, &{}\quad \hbox {if}\, i=n, \end{array}\right. \end{aligned}$$

(41)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{(\mu _{2k+1}^{(i)})^2}{\sum _{j=1}^n(\mu _{2k+1}^{(j)})^2}= \left\{ \begin{array}{ll} \displaystyle \frac{c^2\varPsi ^2(\lambda _{n})}{\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n})}, &{}\quad \hbox {if}\; i=1, \\ 0, &{}\quad \hbox {if}\;i=2,\ldots ,n-1, \\ \displaystyle \frac{\varPsi ^2(\lambda _{1})}{\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n})}, &{}\quad \hbox {if}\; i=n, \end{array}\right. \end{aligned}$$

(42)

where c is a nonzero constant and can be determined by the limit in (48).

Proof

By the assumption (40), we know that $q_0$ satisfies (i)–(iii) of Lemma 1. Notice that $q_k = T^k q_0$. Then, by Lemma 3, there exists a $p_*$ such that the sequences $\{q_{2k}\}$ and $\{q_{2k+1}\}$ converge to $p_*$ and $Tp_*$, respectively, which have only two nonzero components satisfying (38), (39) for some $i_1,i_2\in \{1,\ldots ,n\}$, and (32) holds. Hence, if $1< i_1<i_2<n$, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }q_{2k}^{(n)}=0, \qquad \lim _{k\rightarrow \infty }\frac{q_{2k}^{(i_2)}}{q_{2k+2}^{(i_2)}}=1, \end{aligned}$$

(43)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }(\gamma (q_{2k})+\gamma (q_{2k+1}))= \gamma (p_*)+\gamma (Tp_*) = \lambda _{i_1}+\lambda _{i_2}. \end{aligned}$$

In addition, since $q_0^{(1)} > 0 $ and $q_0^{(n)} > 0$ by (40), we can see from the proof of Lemma 3 that $q_k^{(1)}>0$, $q_k^{(n)}>0$ for all $k\ge 0$. Thus, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{q_{2k+2}^{(n)}}{q_{2k}^{(n)}}&=\lim _{k\rightarrow \infty }\frac{q_{2k+2}^{(n)}}{q_{2k}^{(n)}}\frac{q_{2k}^{(i_2)}}{q_{2k+2}^{(i_2)}} =\lim _{k\rightarrow \infty }\frac{(\lambda _n-\gamma (q_{2k+1}))^2(\lambda _n-\gamma (q_{2k}))^2}{(\lambda _{i_2}-\gamma (q_{2k+1}))^2(\lambda _{i_2}-\gamma (q_{2k}))^2} \nonumber \\&=\frac{(\lambda _n-\gamma (p_*))^2(\lambda _n-\gamma (Tp_*))^2}{(\lambda _{i_2}-\gamma (p_*))^2(\lambda _{i_2}-\gamma (Tp_*))^2} >1, \end{aligned}$$

(44)

where the inequality is due to $\lambda _{i_1}< \gamma (p_*),\gamma (Tp_*) < \lambda _{i_2}$ and $\lambda _n > \lambda _{i_2}$. So, $q_{2k}^{(n)}\rightarrow +\infty $, which contradicts (43). Then, we must have $i_2=n$. In a similar way, we can show that $i_1=1$. Finally, the equalities in (41) and (42) follow directly from Lemma 2. $\square $

In the following, we refer c as the same constant in Theorem 1. By Theorem 1 we can directly obtain the asymptotic behavior of the stepsize.

Corollary 1

Under the conditions of Theorem 1, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha _{2k}= \frac{\varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n})}{\lambda _{1}(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))} \end{aligned}$$

(45)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha _{2k+1}= \frac{\varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n})}{\lambda _{1}(\kappa \varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n}))}, \end{aligned}$$

(46)

where $\alpha _k$ is defined in (11) and $\kappa = \lambda _n/\lambda _1$ is the condition number of A. Moreover,

$$\begin{aligned} \lim _{k\rightarrow \infty }\left( \frac{1}{\alpha _{2k}}+\frac{1}{\alpha _{2k+1}}\right) =\lambda _{1}+\lambda _{n}. \end{aligned}$$

(47)

The next corollary interprets the constant c. A special result for the case $\varPsi (A)=I$ (i.e., the SD method) can be found in Lemma 3.4 of [26].

Corollary 2

Under the conditions of Theorem 1, we have

$$\begin{aligned} c=\lim _{k\rightarrow \infty }\frac{\mu _{2k}^{(n)}}{\mu _{2k}^{(1)}} =-\frac{\varPsi (\lambda _{1})}{\varPsi (\lambda _{n})}\lim _{k\rightarrow \infty }\frac{\mu _{2k+1}^{(1)}}{\mu _{2k+1}^{(n)}}. \end{aligned}$$

(48)

Proof

It follows from Theorem 1 that

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{(\mu _{2k}^{(n)})^2}{(\mu _{2k}^{(1)})^2} =\frac{\varPsi ^2(\lambda _{1})}{\varPsi ^2(\lambda _{n})} \lim _{k\rightarrow \infty }\frac{(\mu _{2k+1}^{(1)})^2}{(\mu _{2k+1}^{(n)})^2}=c^2. \end{aligned}$$

(49)

Note that $1/\lambda _n<\alpha _k<1/\lambda _1$ by the assumption (40). And we have by (16) that

$$\begin{aligned} \mu _{2k+2}^{(1)}= \prod _{\ell =1}^2 (1-\alpha _{2k+\ell }\lambda _1)\mu _{2k}^{(1)} \end{aligned}$$

and

$$\begin{aligned} \mu _{2k+2}^{(n)}= \prod _{\ell =1}^2 (1-\alpha _{2k+\ell }\lambda _n)\mu _{2k}^{(n)}. \end{aligned}$$

Thus, the sequence $\Big \{\frac{\mu _{2k}^{(n)}}{\mu _{2k}^{(1)}}\Big \}$, and similarly for $\Big \{\frac{\mu _{2k+1}^{(1)}}{\mu _{2k+1}^{(n)}}\Big \}$, do not change its sign. Hence, without loss of generality, we can assume by (49) that

$$\begin{aligned} c=\lim _{k\rightarrow \infty }\frac{\mu _{2k}^{(n)}}{\mu _{2k}^{(1)}}. \end{aligned}$$

(50)

Then, by (16), (45) and (50), we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{\mu _{2k+1}^{(1)}}{\mu _{2k+1}^{(n)}} =\lim _{k\rightarrow \infty }\frac{\mu _{2k}^{(1)}(1-\alpha _{2k}\lambda _1)}{\mu _{2k}^{(n)}(1-\alpha _{2k}\lambda _n)} =-c\frac{\varPsi (\lambda _{n})}{\varPsi (\lambda _{1})}, \end{aligned}$$

which gives (48). $\square $

We have the following results on the asymptotic convergence of the function value.

Theorem 2

Under the conditions of Theorem 1, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{f(x_{2k+1})-f^*}{f(x_{2k})-f^*} =R_f^1 \quad \text{ and } \quad \lim _{k\rightarrow \infty }\frac{f(x_{2k+2})-f^*}{f(x_{2k+1})-f^*} =R_f^2, \end{aligned}$$

(51)

where

$$\begin{aligned} R_f^1= & {} \frac{c^2(\kappa -1)^2(\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n}))}{(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2(c^2+\kappa )}, \end{aligned}$$

(52)

$$\begin{aligned} R_f^2= & {} \frac{c^2(\kappa -1)^2(c^2+\kappa )\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2(\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n}))}. \end{aligned}$$

(53)

In addition, if $\varPsi (\lambda _{n})=\varPsi (\lambda _{1})$ or $c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})$, then $R_f^1=R_f^2$.

Proof

Let $\epsilon _k=x_k-x^*$. Since $g_k=A\epsilon _k$, by (14), we have

$$\begin{aligned} \epsilon _k=\sum _{i=1}^n\lambda _i^{-1}\mu _k^{(i)}\xi _i. \end{aligned}$$

By Theorem 1, we only need to consider the case $\mu _k^{(i)}=0$, $i=2,\ldots ,n-1$, that is,

$$\begin{aligned} \epsilon _k=\lambda _1^{-1}\mu _k^{(1)}\xi _1+\lambda _n^{-1}\mu _k^{(n)}\xi _n. \end{aligned}$$

Thus,

$$\begin{aligned} f(x_k)-f^*&=\frac{1}{2}\epsilon _k ^T A\epsilon _k =\frac{1}{2}\frac{\lambda _n(\mu _k^{(1)})^2+\lambda _1(\mu _k^{(n)})^2}{\lambda _1\lambda _n}. \end{aligned}$$

(54)

Since

$$\begin{aligned} g_k=\mu _k^{(1)}\xi _1+\mu _k^{(n)}\xi _n \quad \text{ and } \quad \alpha _k=\frac{\varPsi (\lambda _{1})(\mu _k^{(1)})^2+\varPsi (\lambda _{n})(\mu _k^{(n)})^2}{\lambda _1\varPsi (\lambda _{1})(\mu _k^{(1)})^2+\lambda _n\varPsi (\lambda _{n})(\mu _k^{(n)})^2}, \end{aligned}$$

by the definition of $\epsilon _k$ and the update rule (2), we further have that

$$\begin{aligned} \epsilon _{k+1}&=\epsilon _k-\alpha _kg_k =(\lambda _1^{-1}-\alpha _k)\mu _k^{(1)}\xi _1+(\lambda _n^{-1}-\alpha _k)\mu _k^{(n)}\xi _n\\&=\frac{\varPsi \left( \lambda _{n}\right) \left( \lambda _n-\lambda _1\right) \left( \mu _k^{(n)}\right) ^2\mu _k^{(1)}}{\lambda _1\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi (\lambda _{n})\left( \mu _k^{(n)}\right) ^2\right) }\xi _1\\&\quad +\frac{\varPsi \left( \lambda _{1}\right) \left( \lambda _1-\lambda _n\right) \left( \mu _k^{(1)}\right) ^2\mu _k^{(n)}}{\lambda _n\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) }\xi _n\\&= \frac{\left( \lambda _n-\lambda _1\right) \left( \lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\mu _k^{(1)}\xi _1 -\lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\mu _k^{(n)}\xi _n\right) }{\lambda _1\lambda _n\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) }. \end{aligned}$$

Hence, we obtain

$$\begin{aligned} f(x_{k+1})-f^*&=\frac{1}{2}\epsilon _{k+1} ^T A\epsilon _{k+1}\nonumber \\&= \frac{1}{2}\frac{\left( \lambda _n-\lambda _1\right) ^2\left( \mu _k^{(1)}\right) ^2\left( \mu _k^{(n)}\right) ^2 \left( \lambda _n\varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\lambda _1\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\lambda _1\lambda _n\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2}. \end{aligned}$$

(55)

Combining (54) with (55) yields that

$$\begin{aligned} \frac{f(x_{k+1})-f^*}{f(x_{k})-f^*}&=\frac{\epsilon _{k+1} ^T A\epsilon _{k+1}}{\epsilon _k ^T A\epsilon _k}\\&= \frac{\left( \mu _k^{(1)}\right) ^2\left( \mu _k^{(n)}\right) ^2\left( \kappa -1\right) ^2 \left( \kappa \varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\left( \varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\kappa \varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2 \left( \kappa \left( \mu _k^{(1)}\right) ^2+\left( \mu _k^{(n)}\right) ^2\right) }, \end{aligned}$$

which gives (51) by substituting the limits of $(\mu _k^{(1)})^2$ and $\left( \mu _k^{(n)}\right) ^2$ in Theorem 1.

Notice $\kappa >1$ by our assumption. So, $R_f^1=R_f^2$ is equivalent to

$$\begin{aligned} \frac{\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n})}{(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2(c^2+\kappa )} =\frac{(c^2+\kappa )\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2(\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n}))}, \end{aligned}$$

which by rearranging terms gives

$$\begin{aligned} c^4\varPsi ^2(\lambda _{n})(\varPsi (\lambda _{n})-\varPsi (\lambda _{1}))=\varPsi ^2(\lambda _{1})(\varPsi (\lambda _{n})-\varPsi (\lambda _{1})). \end{aligned}$$

Hence, $R_f^1=R_f^2$ holds if $\varPsi (\lambda _{n})=\varPsi (\lambda _{1})$ or $c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})$. $\square $

Remark 1

Theorem 2 indicates that, when $\varPsi (A)=I$ (i.e., the SD method), the two sequences $\Big \{\frac{\varDelta _{2k+1}}{\varDelta _{2k}} \Big \}$ and $\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k+1}} \Big \}$ converge at the same speed, where $\varDelta _k = f(x_k) - f^*$. Otherwise, the two sequences may converge at different rates.

To illustrate the results in Theorem 2, we apply gradient method (11) with $\varPsi (A)=A$ (i.e., the MG method) to an instance of (3), where the vector of all ones was used as the initial point, the matrix A is diagonal with

$$\begin{aligned} A_{ii}=i\sqrt{i},~~i=1,\ldots ,n, \end{aligned}$$

(56)

and $b=0$. Fig. 1 clearly shows the difference between $R_f^1$ and $R_f^2$.

The next theorem shows the asymptotic convergence of the gradient norm.

Theorem 3

Under the conditions of Theorem 1, the following limits hold,

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{\Vert g_{2k+1}\Vert ^2}{\Vert g_{2k}\Vert ^2} =R_g^1 \quad \text{ and } \quad \lim _{k\rightarrow \infty }\frac{\Vert g_{2k+2}\Vert ^2}{\Vert g_{2k+1}\Vert ^2} =R_g^2, \end{aligned}$$

(57)

where

$$\begin{aligned} R_g^1= & {} \frac{c^2(\kappa -1)^2(\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n}))}{(1+c^2)(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2}, \end{aligned}$$

(58)

$$\begin{aligned} R_g^2= & {} \frac{c^2(1+c^2)(\kappa -1)^2\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2(\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n}))}. \end{aligned}$$

(59)

In addition, if $\varPsi (\lambda _{n})=\kappa \varPsi (\lambda _{1})$ or $c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})$, then $R_g^1=R_g^2$.

Proof

Using the same arguments as in Theorem 2, we have

$$\begin{aligned} \Vert g_{k}\Vert ^2=(\mu _k^{(1)})^2+(\mu _k^{(n)})^2 \end{aligned}$$

and

$$\begin{aligned} \Vert g_{k+1}\Vert ^2&=\epsilon _{k+1} ^T A^2\epsilon _{k+1}\\&=\frac{\left( \lambda _n-\lambda _1\right) ^2\left( \mu _k^{(1)}\right) ^2\left( \mu _k^{(n)}\right) ^2 \left( \varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2}, \end{aligned}$$

which give that

$$\begin{aligned} \frac{\Vert g_{k+1}\Vert ^2}{\Vert g_{k}\Vert ^2} = \frac{(\kappa -1)^2(\mu _k^{(1)})^2(\mu _k^{(n)})^2 \left( \varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\left( \varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\kappa \varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2 \left( \left( \mu _k^{(1)}\right) ^2+\left( \mu _k^{(n)}\right) ^2\right) }. \end{aligned}$$

Thus, (57) follows by substituting the limits of $(\mu _k^{(1)})^2$ and $(\mu _k^{(n)})^2$ in Theorem 1.

Notice $\kappa >1$ by our assumption. So, $R_g^1=R_g^2$ is equivalent to

$$\begin{aligned} \frac{\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n})}{(1+c^2)(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2} =\frac{(1+c^2)\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2 (\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n}))}, \end{aligned}$$

which by rearranging terms gives

$$\begin{aligned} c^4\varPsi ^2(\lambda _{n})(\kappa \varPsi (\lambda _{1})-\varPsi (\lambda _{n}))= \varPsi ^2(\lambda _{1})(\kappa \varPsi (\lambda _{1})-\varPsi (\lambda _{n})). \end{aligned}$$

Hence, $R_g^1=R_g^2$ holds if $\varPsi (\lambda _{n})=\kappa \varPsi (\lambda _{1})$ or $c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})$. $\square $

Remark 2

Theorem 3 indicates that the two sequences $\Big \{\frac{\Vert g_{2k+1}\Vert ^2}{\Vert g_{2k}\Vert ^2}\Big \}$ and $\Big \{\frac{\Vert g_{2k+2}\Vert ^2}{\Vert g_{2k+1}\Vert ^2}\Big \}$ generated by the method (11) with $\varPsi (A)=A$ (i.e., the MG method) converge at the same rate. Otherwise, the two sequences may converge at different rates.

By Theorems 2 and 3, we can obtain the following corollary.

Corollary 3

Under the conditions of Theorem 1, we have

$$\begin{aligned}&\lim _{k\rightarrow \infty }\frac{f(x_{2k+3})-f^*}{f(x_{2k+1})-f^*} =\lim _{k\rightarrow \infty }\frac{f(x_{2k+2})-f^*}{f(x_{2k})-f^*} =R_f^1R_f^2, \end{aligned}$$

(60)

$$\begin{aligned}&\lim _{k\rightarrow \infty }\frac{\Vert g_{2k+3}\Vert ^2}{\Vert g_{2k+1}\Vert ^2} =\lim _{k\rightarrow \infty }\frac{\Vert g_{2k+2}\Vert ^2}{\Vert g_{2k}\Vert ^2} =R_g^1R_g^2. \end{aligned}$$

(61)

In addition,

$$\begin{aligned} R_f^1R_f^2=R_g^1R_g^2=\frac{c^4(\kappa -1)^4\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2}. \end{aligned}$$

(62)

Remark 3

Corollary 3 shows that the odd and even subsequences of objective values and gradient norms converge at the same rate. Moreover, we have

$$\begin{aligned} R_f^1R_f^2=R_g^1R_g^2 =\frac{(\kappa -1)^4}{(1+\kappa /t+t\kappa +\kappa ^2)^2} \le \left( \frac{\kappa -1}{\kappa +1}\right) ^4, \end{aligned}$$

(63)

where $t= c^2 \varPsi (\lambda _{n})/\varPsi (\lambda _{1})$. Notice that the right side of (63) only depends on $\kappa $, which implies these odd and even subsequences generated by all the gradient methods (11) will have the same worst asymptotic rate independent of $\varPsi $.

Now, as in [26], we define the minimum deviation

$$\begin{aligned} \sigma =\min _{i\in \mathcal {I}}~\left| \frac{2 \lambda _i- (\lambda _1+\lambda _n)}{ \lambda _n-\lambda _1 }\right| , \end{aligned}$$

(64)

where

$$\begin{aligned} \mathcal {I}=\{i: \lambda _1<\lambda _i<\lambda _n,~g_0 ^T \xi _i\ne 0,~\text {and}~\lambda _i\ne \alpha _k~\text {for all}~k\}. \end{aligned}$$

Clearly, $\sigma \in (0,1)$. We now close this section by deriving a bound on the constant c defined in Theorem 1. The following theorem generalizes the results in [1] and [26], where only the case $\varPsi (A)=I$ (i.e., the SD method) is considered.

Theorem 4

Under the conditions of Theorem 1, and assuming that $\mathcal {I}$ is nonempty, we have

$$\begin{aligned} \frac{\varPsi (\lambda _{1})}{\varPsi (\lambda _{n})}\frac{1}{\phi _\sigma }\le c^2\le \frac{\varPsi (\lambda _{1})}{\varPsi (\lambda _{n})}\phi _\sigma , \end{aligned}$$

(65)

where

$$\begin{aligned} \phi _\sigma =\frac{2+\eta _\sigma +\sqrt{\eta _\sigma ^2+4\eta _\sigma }}{2} \quad \text{ and } \quad \eta _\sigma =4\left( \frac{1+\sigma ^2}{1-\sigma ^2}\right) . \end{aligned}$$

(66)

Proof

Let $p=q_0$. By the definition of T, we have that

$$\begin{aligned} \frac{(T^{k+2}p)^{(i)}}{(T^{k+2}p)^{(1)}}= \frac{(T^{k}p)^{(i)}}{(T^{k}p)^{(1)}} \frac{(\lambda _i-\gamma (T^{k}p))^2(\lambda _i-\gamma (T^{k+1}p))^2}{(\lambda _1-\gamma (T^{k}p))^2(\lambda _1-\gamma (T^{k+1}p))^2}. \end{aligned}$$

(67)

It follows from Theorem 1 and Lemma 3 that

$$\begin{aligned} \frac{(T^{k}p)^{(i)}}{(T^{k}p)^{(1)}}\rightarrow 0,~i=2,\ldots ,n-1. \end{aligned}$$

(68)

By the continuity of T and (37) in Lemma 3, we always have that

$$\begin{aligned} \frac{(\lambda _i-\gamma (T^{k}p))^2(\lambda _i-\gamma (T^{k+1}p))^2}{(\lambda _1-\gamma (T^{k}p))^2(\lambda _1-\gamma (T^{k+1}p))^2}\rightarrow \frac{(\lambda _i-\gamma (p_*))^2(\lambda _i-\gamma (Tp_*))^2}{(\lambda _1-\gamma (p_*))^2(\lambda _1-\gamma (Tp_*))^2}, \end{aligned}$$

which together with (67) and (68) implies that

$$\begin{aligned} \frac{(\lambda _i-\gamma (p_*))^2(\lambda _i-\gamma (Tp_*))^2}{(\lambda _1-\gamma (p_*))^2(\lambda _1-\gamma (Tp_*))^2} \le 1,~i=2,\ldots ,n-1, \end{aligned}$$

(69)

where $p_*$ is the same vector as in Lemma 3. Clearly, (69) also holds for $i=1$. As for $i=n$, it follows from (33) in Lemma 2 and Theorem 1 that

$$\begin{aligned} \gamma (p_*)+\gamma (Tp_*)=\lambda _1+\lambda _n, \end{aligned}$$

(70)

which yields that

$$\begin{aligned} \frac{(\lambda _n-\gamma (p_*))^2(\lambda _n-\gamma (Tp_*))^2}{(\lambda _1-\gamma (p_*))^2(\lambda _1-\gamma (Tp_*))^2}=1. \end{aligned}$$

Thus, (69) holds for $i=1,\ldots ,n$. Hence, we have

$$\begin{aligned}&\left( \lambda _i-\delta -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \lambda _i-\delta -\left( \gamma (Tp_*)-\delta \right) \right) ^2\nonumber \\&\quad \le \left( \lambda _1-\delta -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \lambda _1-\delta -\left( \gamma (Tp_*)-\delta \right) \right) ^2, \end{aligned}$$

(71)

where $\delta =\frac{\lambda _1+\lambda _n}{2}$. By (70) and (71), we obtain

$$\begin{aligned}&\left( \lambda _i-\delta -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \lambda _i-\delta +\left( \gamma (p_*)-\delta \right) \right) ^2 \\&\quad \le \left( \frac{\lambda _1-\lambda _n}{2} -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \frac{\lambda _1-\lambda _n}{2} +\left( \gamma (p_*)-\delta \right) \right) ^2, \end{aligned}$$

which implies that

$$\begin{aligned} \left( \frac{\lambda _1-\lambda _n}{2}\right) ^2+ \left( \lambda _i-\delta \right) ^2 \ge 2\left( \gamma (p_*)-\delta \right) ^2. \end{aligned}$$

(72)

By Lemma 2 and Theorem 1, we have that

$$\begin{aligned} \gamma (p_*)=\frac{\lambda _1\varPsi (\lambda _{1})p_*^{(1)}+\lambda _n\varPsi (\lambda _{n})p_*^{(n)}}{\varPsi (\lambda _{1})p_*^{(1)}+\varPsi (\lambda _{n})p_*^{(n)}}. \end{aligned}$$

Substituting $\gamma (p_*)$ into (72), we obtain

$$\begin{aligned} \left( \frac{\lambda _1-\lambda _n}{2}\right) ^2+ \left( \lambda _i-\delta \right) ^2 \ge \frac{(\lambda _n-\lambda _1)^2(\varPsi (\lambda _{n})c^2-\varPsi (\lambda _{1}))^2}{2(\varPsi (\lambda _{n})c^2+\varPsi (\lambda _{1}))^2}, \end{aligned}$$

which gives

$$\begin{aligned} 4\left( \frac{1+\sigma _i^2}{1-\sigma _i^2}\right) \ge \frac{(c^2\varPsi (\lambda _{n})-\varPsi (\lambda _{1}))^2}{c^2\varPsi (\lambda _{1})\varPsi (\lambda _{n})}, \end{aligned}$$

(73)

where $\sigma _i=\frac{2 \lambda _i-(\lambda _1+\lambda _n)}{\lambda _n-\lambda _1}$. Noting that (73) holds for all $i\in \mathcal {I}$. Thus, we have

$$\begin{aligned} \frac{(c^2\varPsi (\lambda _{n})-\varPsi (\lambda _{1}))^2}{c^2\varPsi (\lambda _{1})\varPsi (\lambda _{n})}\le \eta _\sigma , \end{aligned}$$

(74)

which implies (65). This completes the proof. $\square $

3 Techniques for Breaking the Zigzagging Pattern

As shown in the previous section, all the gradient methods (11) asymptotically conduct the searches in the two-dimensional subspace spanned by $\xi _1$ and $\xi _n$. By (16), if either $\mu _k^{(1)}$ or $\mu _k^{(n)}$ equals to zero, the corresponding component will vanish at all subsequent iterations. Hence, in order to break the undesired zigzagging pattern, a good strategy is to employ some stepsize approximating $1/\lambda _1$ or $1/\lambda _n$. In this section, we will derive a new stepsize converging to $1/\lambda _n$ and propose a periodic gradient method using this new stepsize.

3.1 A New Stepsize

Our new stepsize will be derived by imposing finite termination on minimizing two-dimensional strictly convex quadratic function, see [31] for the case of $\varPsi (A)=I$ (i.e., the SD method).

To proceed, consider the quadratic optimization (3) with $n=2$. Suppose $x_3$ is the minimizer after the following 3 iterations:

$$\begin{aligned} x_1&= x_0 - \alpha _0g_0,\\ x_2&= x_1 - \alpha _1g_1,\\ x_3&= x_2 - \alpha _2g_2, \end{aligned}$$

where $g_i \ne 0$, $i=0,1,2$, $\alpha _0$ and $\alpha _2$ are stepsizes given by (11), and $\alpha _1$ is a stepsize to be derived.

By the stepsize definition (11), we have

$$\begin{aligned} g_0 ^T \varPsi (A)g_{1}=g_0 ^T \varPsi (A)g_0-\alpha _0g_0 ^T \varPsi (A)Ag_0=0. \end{aligned}$$

(75)

Hence, for any given $r \in \mathbb {R}$, $\Big \{\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }, \frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }\Big \}$ forms an orthonormal basis of $\mathbb {R}^2$. Then, for all $x\in \mathbb {R}^2$, $x-x_1$ can be expressed by this basis. Suppose there exist $t,l\in \mathbb {R}$ such that

$$x-x_1=t\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }+l\frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }.$$

We expand f(x) at $x_1$ and obtain

$$\begin{aligned} f(x)= & {} f\left( x_1+t\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert } +l\frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }\right) \nonumber \\= & {} f(x_1) +\vartheta ^T \begin{pmatrix} t \\ l \\ \end{pmatrix} + \frac{1}{2} \begin{pmatrix} t \\ l \\ \end{pmatrix} ^T H \begin{pmatrix} t \\ l \\ \end{pmatrix}:=\varphi (t,l), \end{aligned}$$

(76)

where

$$\begin{aligned} \vartheta = B g_1 = \begin{pmatrix} \frac{g_1 ^T \varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }\\ \frac{g_1 ^T \varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert } \end{pmatrix} \text{ with } B=\begin{pmatrix} \frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }, \frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert } \\ \end{pmatrix} ^T \end{aligned}$$

(77)

and

$$\begin{aligned} H= B A B ^T&= \begin{pmatrix} \frac{g_0 ^T \varPsi ^{2r}(A)Ag_0}{\Vert \varPsi ^r(A)g_0\Vert ^2} &{} \frac{g_0 ^T \varPsi (A) Ag_1}{\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } \\ \frac{g_0 ^T \varPsi (A)Ag_1}{\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } &{} \frac{g_1 ^T \varPsi ^{2(1-r)}(A)Ag_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert ^2}\\ \end{pmatrix}\nonumber \\&= \begin{pmatrix} \frac{g_0 ^T \varPsi ^{2r}(A)Ag_0}{\Vert \varPsi ^r(A)g_0\Vert ^2} &{} -\frac{g_1 ^T \varPsi (A) g_1}{\alpha _0\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } \\ -\frac{g_1 ^T \varPsi (A) g_1}{\alpha _0\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } &{} \frac{g_1 ^T \varPsi ^{2(1-r)}(A)Ag_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert ^2} \\ \end{pmatrix}, \end{aligned}$$

(78)

where last equality is due to $g_0 ^T \varPsi (A)Ag_1=-\frac{g_1 ^T \varPsi (A)g_1}{\alpha _0}$ which follows from $g_1=g_0-\alpha _0Ag_0$ and (75). Note that $B ^T B = B B ^T = I$. Then H is positive definite and has the same eigenvalues as A. Consequently, the minimizer of $\varphi $ is given by $\begin{pmatrix} t^* \\ l^* \\ \end{pmatrix} =-H^{-1}\vartheta $.

When $x_3$ is the solution, we must have

$$\begin{aligned} x_3=x_1+t^*\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }+l^*\frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }=x_1+B ^T \begin{pmatrix} t^* \\ l^* \\ \end{pmatrix}. \end{aligned}$$

(79)

Due to the fact that $x_3 = x_2 - \alpha _2 g_2$, we know $x_3-x_2$ is parallel to $g_2$. It follows from (79) and $x_2 = x_1 - \alpha _1 g_1$ that

$$\begin{aligned} B ^T \begin{pmatrix} t^* \\ l^* \\ \end{pmatrix} + \alpha _1 g_1 ~~\parallel ~~ g_2, \end{aligned}$$

(80)

which, by noting $g_2 = g_1 - \alpha _1 Ag_1$ and expressing the two vectors in (80) by the above basis, is equivalent to

$$\begin{aligned} \begin{pmatrix} t^* \\ l^* \\ \end{pmatrix} -(-\alpha _1\vartheta ) = -(H^{-1}\vartheta -\alpha _1 \vartheta ) ~~\parallel ~~ \vartheta +H(-\alpha _1\vartheta ). \end{aligned}$$

(81)

Denote the components of $\vartheta $ by $\vartheta _i$, and the components of H by $H_{ij}$, $i,j=1,2$. By (81), we would have

$$\begin{aligned} \begin{pmatrix} H_{22}\vartheta _1-H_{12}\vartheta _2-\alpha _1\zeta \vartheta _1 \\ H_{11}\vartheta _2-H_{12}\vartheta _1-\alpha _1\zeta \vartheta _2\\ \end{pmatrix} \quad \text{ and } \quad \begin{pmatrix} \vartheta _1-\alpha _1(H_{11}\vartheta _1+H_{12}\vartheta _2) \\ \vartheta _2-\alpha _1(H_{12}\vartheta _1+H_{22}\vartheta _2)\\ \end{pmatrix}. \end{aligned}$$

are parallel, where $\zeta =\text {det}(H) = \text {det}(A) >0$. It follows that

$$\begin{aligned}&(H_{22}\vartheta _1-H_{12}\vartheta _2-\alpha _1\zeta \vartheta _1) [\vartheta _2-\alpha _1(H_{12}\vartheta _1+H_{22}\vartheta _2)]\\&\quad = (H_{11}\vartheta _2-H_{12}\vartheta _1-\alpha _1\zeta \vartheta _2) [\vartheta _1-\alpha _1(H_{11}\vartheta _1+H_{12}\vartheta _2)], \end{aligned}$$

which gives

$$\begin{aligned} \alpha _1^2\zeta \varGamma -\alpha _1(H_{11}+H_{22})\varGamma +\varGamma =0, \end{aligned}$$

(82)

where

$$\begin{aligned} \varGamma =(H_{12}\vartheta _1+H_{22}\vartheta _2)\vartheta _1-(H_{11}\vartheta _1+H_{12}\vartheta _2)\vartheta _2. \end{aligned}$$

On the other hand, if (82) holds, we have (80) holds, which by (77), $H^{-1} = B A^{-1} B ^T$ and $B ^T B=I$ implies that

$$\begin{aligned} - B ^T H^{-1} \vartheta + \alpha _1 g_1 = - A^{-1} g_1 + \alpha _1 g_1 = - A^{-1}(g_1- \alpha _1 A g_1) = - A^{-1} g_2 \end{aligned}$$

is parallel to $g_2$. Hence, $g_2$ is an eigenvector of A, i.e. $Ag_2 = \lambda g_2$ for some $\lambda > 0$, since $g_2 \ne 0$. So, by (11), we get $\alpha _2 = \varPsi (\lambda ) g_2 ^T g_2 /(\lambda \varPsi (\lambda ) g_2 ^T g_2 ) = 1/\lambda $. Therefore, $g_3 = g_2 - \alpha _2 A g_2 = g_2 - \alpha _2 \lambda g_2 = 0$, which implies $x_3$ is the solution. Thus, (82) guarantees $x_3$ is the minimizer.

Based on the above analysis, to ensure $x_3$ is the minimizer, we only need to choose $\alpha _1$ satisfying

$$\begin{aligned} \alpha _1^2\zeta -\alpha _1(H_{11}+H_{22})+1=0, \end{aligned}$$

(83)

whose two positive roots are

$$\begin{aligned} \frac{(H_{11}+H_{22})\pm \sqrt{(H_{11}+H_{22})^2-4\zeta }}{2\zeta }. \end{aligned}$$

These two roots are $1/\lambda _1$ and $1/\lambda _2$, where $0< \lambda _1 < \lambda _2$ are two eigenvalues of A. For numerical reasons (see next subsection), we would like to choose $\alpha _1$ to be the smaller one $1/\lambda _2$, which can be calculated as

$$\begin{aligned} \alpha _1&= \frac{2}{(H_{11}+H_{22})+\sqrt{(H_{11}+H_{22})^2-4\zeta }} \nonumber \\&= \frac{2}{(H_{11}+H_{22})+\sqrt{(H_{11}-H_{22})^2+4H_{12}^2}}. \end{aligned}$$

(84)

For general case, since $g_{k-1} ^T \varPsi (A)g_{k}=0$ we could propose our new stepsize at the k-th iteration as

$$\begin{aligned} \tilde{\alpha }_{k} =\frac{2}{(H_{11}^k+H_{22}^k)+\sqrt{(H_{11}^k-H_{22}^k)^2+4(H_{12}^k)^2}}, \end{aligned}$$

(85)

where $H_{ij}^k$ is the component of $H^k$:

$$\begin{aligned} H^k= \begin{pmatrix} \frac{g_{k-1} ^T \varPsi ^{2r}(A)Ag_{k-1}}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2} &{} -\frac{g_k ^T \varPsi (A) g_k}{\alpha _{k-1}\Vert \varPsi ^r(A)g_{k-1}\Vert \Vert \varPsi ^{1-r}(A)g_k\Vert } \\ -\frac{g_k ^T \varPsi (A) g_k}{\alpha _{k-1}\Vert \varPsi ^r(A)g_{k-1}\Vert \Vert \varPsi ^{1-r}(A)g_k\Vert } &{} \frac{g_k ^T \varPsi ^{2(1-r)}(A)Ag_k}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2} \\ \end{pmatrix} \end{aligned}$$

(86)

and $\alpha _{k-1}$ is given by (11). Clearly, $\alpha _k^Y$ in (8) can be obtained by setting $\varPsi (A)=I$ in (86). In addition, by (85) we have that

$$\begin{aligned} \frac{1}{H_{11}^k+H_{22}^k} \le \tilde{\alpha }_k \le \frac{1}{\max \{H_{11}^k,H_{22}^k\}}. \end{aligned}$$

(87)

3.2 Spectral Property of the New Stepsize

The next theorem shows that the stepsize $\tilde{\alpha }_k$ enjoys desirable spectral property.

Theorem 5

Suppose that the conditions of Theorem 1 hold. Let $\{x_k\}$ be the iterations generated by any gradient method in (11) to solve problem (3). Then

$$\begin{aligned} \lim _{k\rightarrow \infty }\tilde{\alpha }_k=\frac{1}{\lambda _n}. \end{aligned}$$

(88)

Proof

It follows from (41) and (42) of Theorem 1 that

$$\begin{aligned}&\lim _{k\rightarrow \infty }H_{11}^k =\lim _{k\rightarrow \infty }\frac{g_{k-1} ^T \varPsi ^{2r}(A)Ag_{k-1}}{\Vert g_{k-1}\Vert ^2} \frac{\Vert g_{k-1}\Vert ^2}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2}\\&\quad =\frac{\lambda _1(c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\kappa \varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1}))}{c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1})} \end{aligned}$$

and

$$\begin{aligned} \lim _{k\rightarrow \infty }H_{22}^k&= \frac{g_k ^T \varPsi ^{2(1-r)}(A)Ag_k}{\Vert g_k\Vert ^2} \frac{\Vert g_k\Vert ^2}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2} \\&=\frac{\lambda _1(\varPsi ^{2(1-r)}(\lambda _{1})+\kappa c^2\varPsi ^{2(1-r)}(\lambda _{n}))}{\varPsi ^{2(1-r)}(\lambda _{1})+ c^2\varPsi ^{2(1-r)}(\lambda _{n})}\\&=\frac{\lambda _1(\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+\kappa c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1}))}{\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1})}, \end{aligned}$$

which give

$$\begin{aligned} \lim _{k\rightarrow \infty }(H_{11}^k+H_{22}^k) =\lambda _1(\kappa +1) \end{aligned}$$

(89)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }(H_{11}^k-H_{22}^k) =\frac{\lambda _1(\kappa -1)(\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})- c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1}))}{\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1})}. \end{aligned}$$

(90)

Then, by the definition of $\alpha _k$, we have

$$\begin{aligned} g_{k} ^T \varPsi (A)g_{k}=-\alpha _{k-1}g_{k-1} ^T \varPsi (A)Ag_{k-1} +\alpha _{k-1}^2g_{k-1} ^T \varPsi (A)A^2g_{k-1}, \end{aligned}$$

which together with (42) in Theorem 1 and (46) in Corollary 1 yields that

$$\begin{aligned} \lim _{k\rightarrow \infty }(H_{12}^k)^2&=\lim _{k\rightarrow \infty }\frac{g_{k} ^T \varPsi (A)g_{k}}{\alpha _{k-1}^2\Vert \varPsi ^r(A)g_{k-1}\Vert ^2} \frac{g_k ^T \varPsi (A) g_k}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2}\\&= \lim _{k\rightarrow \infty } \left( -\frac{1}{\alpha _{k-1}}\frac{g_{k-1} ^T \varPsi (A)Ag_{k-1}}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2} +\frac{g_{k-1} ^T \varPsi (A)A^2g_{k-1}}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2}\right) \frac{g_k ^T \varPsi (A) g_k}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2}\\&= \Bigg [-\frac{\lambda _{1}(\kappa \varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n}))}{\varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n})} \frac{\lambda _1(c^2\varPsi (\lambda _{1})\varPsi ^2(\lambda _{n})+\kappa \varPsi (\lambda _{n})\varPsi ^2(\lambda _{1}))}{c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1})} \\&\quad + \frac{\lambda _1^2(c^2\varPsi (\lambda _{1})\varPsi ^2(\lambda _{n})+\kappa ^2\varPsi (\lambda _{n})\varPsi ^2(\lambda _{1}))}{c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1})} \Bigg ] \frac{(\varPsi (\lambda _{1})+ c^2\varPsi (\lambda _{n}))\varPsi ^{2r}(\lambda _{1})\varPsi ^{2r}(\lambda _{n})}{\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1})}\\&= \frac{\lambda _1^2c^2(\kappa -1)^2\varPsi ^{2+2r}(\lambda _{1})\varPsi ^{2+2r}(\lambda _{n})}{(\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1}))^2}. \end{aligned}$$

Then, from the above equality and (90), we obtain that

$$\begin{aligned} \lim _{k\rightarrow \infty }\sqrt{(H_{11}^k-H_{22}^k)^2+4(H_{12}^k)^2} =\lambda _1(\kappa -1). \end{aligned}$$

(91)

Combining (89) and (91), we have that

$$\begin{aligned} \lim _{k\rightarrow \infty }\tilde{\alpha }_k= \frac{2}{\lambda _1(\kappa +1)+\lambda _1(\kappa -1)}=\frac{1}{\lambda _n}. \end{aligned}$$

This completes the proof. $\square $

Remark 4

When $r=1$, we have from (87) that $\tilde{\alpha }_k \le 1/H_{22}^k = \alpha _k^{SD}$. Hence, using the stepsize $\tilde{\alpha }_k$ will give a monotone gradient method. Theorem 5 indicates that the general $\tilde{\alpha }_k$ will have the asymptotic spectral property (88), and hence will be asymptotically be smaller than $\alpha _k^{SD}$ independent of r. But a proper choice r will facilitate the calculation of $\tilde{\alpha }_k$. This will be more clear in the next section.

Using the similar arguments, we can also show the larger stepsize derived in Sect. 3.1 converges to $1/\lambda _1$.

Theorem 6

Let

$$\begin{aligned} \bar{\alpha }_k= \frac{2}{(H_{11}^k+H_{22}^k)-\sqrt{(H_{11}^k-H_{22}^k)^2+4(H_{12}^k)^2}}. \end{aligned}$$

Under the conditions of Theorem 5, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\bar{\alpha }_k=\frac{1}{\lambda _1}. \end{aligned}$$

To present an intuitive illustration of the asymptotic behaviors of $\tilde{\alpha }_k$ and $\bar{\alpha }_k$, we applied the gradient method (11) with $\varPsi (A)=A$ (i.e., the MG method) to minimize the quadratic function (3) with

$$\begin{aligned} A=\text {diag}\{a_1,a_2,\ldots ,a_n\} \quad \text{ and } \quad b=0, \end{aligned}$$

(92)

where $a_1=1$, $a_n=n$ and $a_i$ is randomly generated between 1 and n for $i=2,\ldots ,n-1$. From Fig. 2, we can see that $\tilde{\alpha }_k$ approximates $1/\lambda _n$ with satisfactory accuracy in a few iterations. However, $\bar{\alpha }_{k}$ converges to $1/\lambda _1$ even slower than the decreasing of gradient norm. This, to some extent, explains the reason why we prefer $\tilde{\alpha }_k$ to the short stepsize.

3.3 A Periodic Gradient Method

A method alternately using $\alpha _k$ in (11) and $\tilde{\alpha }_k$ to minimize a two-dimensional quadratic function will monotonically decrease the objective value, and terminates in 3 iterations. However, for minimizing a general n-dimensional quadratic function, this alternating scheme may not be efficient for the purpose of vanishing the component $\mu _k^{(n)}$. One possible reason is that, as shown in Fig. 2, it needs tens of iterations for $\tilde{\alpha }_k$ to approximate $1/\lambda _n$ with satisfactory accuracy. In what follows, by incorporating the BB method, we develop an efficient periodic gradient method using $\tilde{\alpha }_k$.

Figure 3 illustrates a comparison of the gradient method (11) using $\varPsi (A)=A$ (i.e., the MG method) with a method using 20 BB2 steps first and then MG steps on solving problem (56). We can see that by using some BB2 steps, the modified MG method is accelerated and the stepsize $\tilde{\alpha }_k$ will approximate $1/\lambda _n$ with a better accuracy. Thus, our method will run some BB steps first. Now, we investigate the affect of reusing a short stepsize on the performance of the gradient method (11). Suppose that we have a good approximation of $1/\lambda _n$, say $\alpha =\frac{1}{\lambda _n+10^{-6}}$. We compare MG method with its two variants by applying (i) $\alpha _0=\alpha $ or (ii) $\alpha _0=\ldots =\alpha _9=\alpha $ before using the MG stepsize. Figure 4 shows that reusing $\alpha $ will accelerate the MG method. Hence, we prefer to reuse $\tilde{\alpha }_k$ for some consecutive steps when $\tilde{\alpha }_k$ is a good approximation of $1/\lambda _n$. Finally, our new method is summarized in Algorithm 1, which periodically applies the BB stepsize, $\alpha _k$ in (11) and $\tilde{\alpha }_k$. The R-linear global convergence of Algorithm 1 for solving (3) can be established by showing that it satisfies the property in [5], see the proof of Theorem 3 in [7] for example.

Remark 5

The BB steps in Algorithm 1 can either employ the BB1 or BB2 stepsize in (7). The idea of using short stepsizes to eliminate the component $\mu _k^{(n)}$ has been investigated in [12, 13, 20]. However, these methods are based on the SD method, i.e., occasionally applying short steps during the iterates of the SD method. One exception is given by [21], where a method is developed by employing new stepsizes during the iterates of the AOPT method. But our method periodically uses three different stepsizes: the BB stepsize, stepsize (11) and the new stepsize $\tilde{\alpha }_k$.

The following example shows that both BB steps and $\tilde{\alpha }_k$-steps are important to the efficiency of our method. In particular, we consider $\varPsi (A)=I$ for $\alpha _k$ in (11) and $\tilde{\alpha }_k$. We applied Algorithm 1 to a 1000 dimensional quadratic problem in the form (92) with $a_i=11i-10$ (see [11]). The iteration was stopped once the gradient norm reduces by a factor of $\epsilon $. Five different values of the parameter pair $(K_b,K_m,K_s)$ were tested. Table 1 presents average number of iterations over ten different initial points randomly chosen from $[-10,10]$ required by the method to meet a given tolerance. We see that, when Algorithm 1 takes no BB steps, i.e. using (0, 60, 10), it performs better than using (10, 0, 0) for a tight tolerance. Notice that Algorithm 1 with (10, 0, 0) reduces to the BB method with $\alpha _k^{BB1}$. If Algorithm 1 takes no $\tilde{\alpha }_k$-steps, i.e. using (50, 60, 0), it is comparable to the case (0, 60, 10). However, when both BB steps and $\tilde{\alpha }_k$-steps are taken, i.e. using (50, 60, 10), Algorithm 1 clearly outperforms the other choices especially for a tight tolerance.

Table 1 Average number of iterations required by Algorithm 1 with different values of $(K_b,K_m,K_s)$

Full size table

4 Numerical Experiments

In this section, we present numerical comparisons of Algorithm 1 and the following methods: BB with $\alpha _k^{BB1}$ [2], DY [11], ABBmin2 [19], SDC [12], and Alg. 1 in [30] (A1 for short).

Notice that the stepsize rule for Algorithm 1 can be written as

$$\begin{aligned} \alpha _{k}={\left\{ \begin{array}{ll} \alpha _{k}^{BB},&{} \text {if}\, \text {mod}(k,K_b+K_m+K_s)<K_b; \\ \alpha _{k}(\varPsi (A)),&{} \text {if}\,K_b\le \text {mod}(k,K_b+K_m+K_s)<K_b+K_m; \\ \tilde{\alpha }_{k}(\varPsi (A)),&{} \text {if}\,\text {mod}(k,K_b+K_m+K_s)=K_b+K_m; \\ \alpha _{k-1},&{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

(93)

where $\alpha _k^{BB}$ can be either $\alpha _k^{BB1}$ or $\alpha _k^{BB2}$, $\alpha _{k}(\varPsi (A))$ and $\tilde{\alpha }_{k}(\varPsi (A))$ are the stepsizes given by (11) and (85), respectively. We tested the following four variants of Algorithm 1 using combinations of the two BB stepsizes and $\varPsi (A)=I$ or A:

BB1SD: $\alpha _k^{BB1}$ and $\varPsi (A)=I$ in (93)
BB1MG: $\alpha _k^{BB1}$ and $\varPsi (A)=A$ in (93)
BB2SD: $\alpha _k^{BB2}$ and $\varPsi (A)=I$ in (93)
BB2MG: $\alpha _k^{BB2}$ and $\varPsi (A)=A$ in (93)

Now we derive a formula for the case $\varPsi (A)=A$, i.e., $\alpha _{k}(\varPsi (A))=\alpha _{k}^{MG}$. If we set $r=0$, by (85), we have

$$\begin{aligned} \tilde{\alpha }_k=\frac{2}{\left( \frac{1}{\alpha _{k-1}^{SD}}+ \frac{g_k ^T A^3g_k}{g_k ^T A^2g_k}\right) + \sqrt{\left( \frac{1}{\alpha _{k-1}^{SD}}- \frac{g_k ^T A^3g_k}{g_k ^T A^2g_k}\right) ^2+ \frac{4(g_k ^T Ag_k)^2}{(\alpha _{k-1}^{MG})^2\Vert g_{k-1}\Vert ^2g_k ^T A^2g_k}}}, \end{aligned}$$

(94)

which is expensive to compute directly. However, if we set $r=1/2$, we get

$$\begin{aligned} \tilde{\alpha }_{k}= \frac{2}{\frac{1}{\alpha _{k-1}^{MG}}+\frac{1}{\alpha _k^{MG}}+ \sqrt{\left( \frac{1}{\alpha _{k-1}^{MG}}-\frac{1}{\alpha _k^{MG}}\right) ^2 +\frac{4g_k ^T A g_k}{(\alpha _{k-1}^{MG})^2g_{k-1} ^T A g_{k-1}}}}. \end{aligned}$$

(95)

This formula can be computed without additional cost because $g_{k-1} ^T A g_{k-1}$ and $g_k ^T A g_k$ have been obtained when computing the stepsizes $\alpha _{k-1}^{MG}$ and $\alpha _k^{MG}$.

All the methods in consideration were implemented in Matlab (v.9.0-R2016a) and carried out on a PC with an Intel Core i7, 2.9 GHz processor and 8 GB of RAM running Windows 10 system. We stopped the algorithm if the number of iteration exceeds 20,000 or the gradient norm reduces by a factor of $\epsilon $.

We have tested five different sets of problems in the form of (92) with spectrum distributions described in Table 2, see [21, 30] for example. The rand function in Matlab was used for generating Hessian of the first three problem sets. To investigate performances of the compared methods on different values of Hessian condition number $\kappa $ and tolerance factor $\epsilon $, we set $\kappa $ to $10^4, 10^5, 10^6$ and $\epsilon $ to $10^{-6}, 10^{-9}, 10^{-12}$, respectively. For each value of $\kappa $ or $\epsilon $, average results over 10 different starting points with entries randomly generated in $[-10,10]$ are presented in the following tables.

Table 2 Distributions of $a_j$

Full size table

The parameter pair $(K_b,K_m,K_s)$ of our four methods was set to (60, 60, 40) for the first three problem sets, and (100, 60, 50) for the last two problem sets. As in [19], the parameter $\tau $ of the ABBmin2 method was set to 0.9. For the SDC and A1 methods, we chose parameters to achieve best performance. Particularly, the parameter pair (h, s) of SDC was set to (8, 6) for the first three problem sets and (30, 2) for the last two problem sets, while m of A1 was set to 6.

Table 3 Average number of iterations required by the compared methods on the five problem sets in Table 2

Full size table

Table 3 shows average number of iterations of the compared methods for the five sets of problems listed in Table 2. We can see that, for the first problem set, our proposed methods perform much better than the BB, DY, SDC and A1 methods, although the ABBmin2 method seems surprisingly efficient among the compared methods. Similar results can be observed for the fourth problem set, our proposed methods are comparable to ABBmin2 and faster than other compared methods. Results on the second problem set are slightly different with the former one, where A1 becomes the winner and our proposed methods outperform BB, DY, ABBmin2 and SDC methods. For the third and last problems sets, those results show evidence of our proposed methods over other compared methods, especially when the condition number is large and a tight tolerance is required.

5 Conclusions and Discussions

We present theoretical analyses on the asymptotic behaviors of a family of gradient methods whose stepsize is given by (11), which includes the steepest descent and minimal gradient methods as special cases. It is shown that each method in this family will asymptotically zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian. In order to accelerate the gradient methods, we exploit the spectral property of a new stepsize to break the zigzagging pattern. This new stepsize is derived by imposing finite termination on minimizing two-dimensional strongly convex quadratics and is proved to converge to the reciprocal of the largest eigenvalue of the Hessian for general n-dimensional case. Finally, we propose a very efficient periodic gradient method that alternately uses the BB stepsize, $\alpha _k$ in (11) and our new stepsize. Our numerical results indicate that, by exploiting the asymptotic behavior and spectral properties of stepsizes, gradient methods can be greatly accelerated to outperform the BB method and other recently developed state-of-the-art gradient methods.

One may also break the zigzagging pattern by employing spectral property in (47). In particular, we could use the following stepsize

$$\begin{aligned} \hat{\alpha }_k=\left( \frac{1}{\alpha _{2k}}+\frac{1}{\alpha _{2k+1}}\right) ^{-1}, \end{aligned}$$

(96)

which, by (47), satisfies that

$$\begin{aligned} \lim _{k\rightarrow \infty }\hat{\alpha }_k = \frac{1}{\lambda _{1}+\lambda _{n}}. \end{aligned}$$

Hence, $ \hat{\alpha }_k$ is also a good approximation of $1/\lambda _n$ when the condition number $\kappa =\lambda _n/\lambda _1$ is large. One may see [13] for the case of the SD method.

References

Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
Article MathSciNet Google Scholar
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Article MathSciNet Google Scholar
Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
Article MathSciNet Google Scholar
Cauchy, A.: Méthode générale pour la résolution des systemes déquations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
Google Scholar
Dai, Y.H.: Alternate step gradient method. Optimization 52(4–5), 395–415 (2003)
Article MathSciNet Google Scholar
Dai, Y.H., Fletcher, R.: On the asymptotic behaviour of some new gradient methods. Math. Program. 103(3), 541–559 (2005)
Article MathSciNet Google Scholar
Dai, Y.H., Huang, Y., Liu, X.W.: A family of spectral gradient methods for optimization. Comput. Optim. Appl. 74(1), 43–65 (2019)
Article MathSciNet Google Scholar
Dai, Y.H., Liao, L.Z.: $R$-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
Article MathSciNet Google Scholar
Dai, Y.H., Yang, X.: A new gradient method with an optimal stepsize property. Comput. Optim. Appl. 33(1), 73–88 (2006)
Article MathSciNet Google Scholar
Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
Article MathSciNet Google Scholar
Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Mang. Optim. 1(2), 181 (2005)
MathSciNet MATH Google Scholar
De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comput. Optim. Appl. 59(3), 541–563 (2014)
Article MathSciNet Google Scholar
De Asmundis, R., di Serafino, D., Riccio, F., Toraldo, G.: On spectral properties of steepest descent methods. IMA J. Numer. Anal. 33(4), 1416–1435 (2013)
Article MathSciNet Google Scholar
Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
MathSciNet MATH Google Scholar
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Article MathSciNet Google Scholar
Elman, H.C., Golub, G.H.: Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal. 31(6), 1645–1661 (1994)
Article MathSciNet Google Scholar
Fletcher, R.: On the Barzilai-Borwein method. In: Optimization and Control with Applications, pp. 235–256. Springer, Boston (2005)
Chapter Google Scholar
Forsythe, G.E.: On the asymptotic directions of the s-dimensional optimum gradient method. Numer. Math. 11(1), 57–76 (1968)
Article MathSciNet Google Scholar
Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Mang. Optim. 4(2), 299 (2008)
MathSciNet MATH Google Scholar
Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comput. Optim. Appl. 63(2), 523–542 (2016)
Article MathSciNet Google Scholar
Huang, Y., Dai, Y.H., Liu, X.W., Zhang, H.: Gradient methods exploiting spectral properties. Optim. Method Softw. 35(4), 681–705 (2020)
Article MathSciNet Google Scholar
Huang, Y., Liu, H.: Smoothing projected Barzilai-Borwein method for constrained non-Lipschitz optimization. Comput. Optim. Appl. 65(3), 671–698 (2016)
Article MathSciNet Google Scholar
Huang, Y., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai-Borwein method for nonnegative matrix factorization. Data Min. Knowl. Disc. 29(6), 1665–1684 (2015)
Article MathSciNet Google Scholar
Jiang, B., Dai, Y.H.: Feasible Barzilai-Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), 756–784 (2013)
Article MathSciNet Google Scholar
Liu, Y.F., Dai, Y.H., Luo, Z.Q.: Coordinated beamforming for miso interference channel: Complexity analysis and efficient algorithms. IEEE Trans. Signal Process. 59(3), 1142–1157 (2011)
Article MathSciNet Google Scholar
Nocedal, J., Sartenaer, A., Zhu, C.: On the behavior of the gradient norm in the steepest descent method. Comp. Optim. Appl. 22(1), 5–35 (2002)
Article MathSciNet Google Scholar
Pronzato, L., Wynn, H.P., Zhigljavsky, A.A.: Asymptotic behaviour of a family of gradient algorithms in $R^d$ and Hilbert spaces. Math. program. 107(3), 409–438 (2006)
Article MathSciNet Google Scholar
Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Article MathSciNet Google Scholar
Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
Article MathSciNet Google Scholar
Sun, C., Liu, J.P.: New stepsizes for the gradient method. Optim. Lett. 14, 1943–1955 (2020)
Article MathSciNet Google Scholar
Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
MathSciNet MATH Google Scholar
Yuan, Y.X.: Step-sizes for the gradient method. AMS IP Stud. Adv. Math. 42(2), 785–796 (2008)
MathSciNet MATH Google Scholar
Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comput. Optim. Appl. 35(1), 69–86 (2006)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 11701137, 11631013, 12071108, 11671116, 11991021, 12021001), the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA27000000), Beijing Academy of Artificial Intelligence (BAAI), Natural Science Foundation of Hebei Province (Grant No. A2021202010), China Scholarship Council (Grant No. 201806705007), and USA National Science Foundation (Grant Nos. DMS-1819161, DMS-2110722).

Author information

Authors and Affiliations

Institute of Mathematics, Hebei University of Technology, Tianjin, 300401, China
Yakui Huang & Xin-Wei Liu
LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
Yu-Hong Dai
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
Yu-Hong Dai
Department of Mathematics, Louisiana State University, Baton Rouge, LA, 70803-4918, USA
Hongchao Zhang

Authors

Yakui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Hong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Xin-Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongchao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongchao Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Y., Dai, YH., Liu, XW. et al. On the Asymptotic Convergence and Acceleration of Gradient Methods. J Sci Comput 90, 7 (2022). https://doi.org/10.1007/s10915-021-01685-8

Download citation

Received: 29 October 2019
Revised: 15 September 2021
Accepted: 26 October 2021
Published: 19 November 2021
DOI: https://doi.org/10.1007/s10915-021-01685-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the Asymptotic Convergence and Acceleration of Gradient Methods

Abstract

Similar content being viewed by others

On R-linear convergence analysis for a class of gradient methods

A gradient method exploiting the two dimensional quadratic termination property

A Modified Spectral Conjugate Gradient Method with Global Convergence

1 Introduction

2 Asymptotic Behavior of the Family (11)

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 1

Proof

Corollary 1

Corollary 2

Proof

Theorem 2

Proof

Remark 1

Theorem 3

Proof

Remark 2

Corollary 3

Remark 3

Theorem 4

Proof

3 Techniques for Breaking the Zigzagging Pattern

3.1 A New Stepsize

3.2 Spectral Property of the New Stepsize

Theorem 5

Proof

Remark 4

Theorem 6

3.3 A Periodic Gradient Method

Remark 5

4 Numerical Experiments

5 Conclusions and Discussions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation