1 Introduction

The gradient method is well-known for solving the following unconstrained optimization

$$\begin{aligned} \min _{x\in \mathbb {R}^n}~f(x), \end{aligned}$$
(1)

where \(f: \mathbb {R}^n \rightarrow \mathbb {R}\) is continuously differentiable, especially when the dimension n is large. In particular, at the k-th iteration gradient methods update the iterates by

$$\begin{aligned} x_{k+1}=x_k-\alpha _kg_k, \end{aligned}$$
(2)

where \(g_k=\nabla f(x_k)\) and \(\alpha _k>0\) is the stepsize determined by the method.

One simplest nontrivial nonlinear instance of (1) is the quadratic optimization

$$\begin{aligned} \min _{x\in \mathbb {R}^n}~f(x)=\frac{1}{2}x ^T Ax-b ^T x, \end{aligned}$$
(3)

where \(b\in \mathbb {R}^n\) and \(A\in \mathbb {R}^{n\times n}\) is symmetric and positive definite. Solving (3) efficiently is usually a pre-requisite for a method to be generalized to solve more general optimization. In addition, by Taylor’s expansion, a general smooth function can be approximated by a quadratic function near the minimizer. So, the local convergence behaviors of gradient methods are often reflected by solving (3). Hence, in this paper, we focus on studying the convergence behaviors and propose efficient gradient methods for solving (3) efficiently.

The classic steepest descent (SD) method proposed by Cauchy [4] solves (3) by using the exact stepsize

$$\begin{aligned} \alpha _k^{SD}=\arg \min _{\alpha }~f(x_k-\alpha g_k)=\frac{g_{k} ^T g_{k}}{g_{k} ^T Ag_{k}}. \end{aligned}$$
(4)

Although \(\alpha _k^{SD}\) minimizes f along the steepest descent direction, the SD method often performs poorly in practice and has linear converge rate [1, 18] as

$$\begin{aligned} \frac{f(x_{k+1})-f^*}{f(x_{k})-f^*}\le \left( \frac{\kappa -1}{\kappa +1}\right) ^2, \end{aligned}$$
(5)

where \(f^*\) is the optimal function value of (3) and \(\kappa =\lambda _n/\lambda _1\) is the condition number of A with \(\lambda _1\) and \(\lambda _n\) being the smallest and largest eigenvalues of A, respectively. Thus, if \(\kappa \) is large, the SD method may converge very slowly. In addition, Akaike [1] proved that the gradients will asymptotically alternate between two directions in the subspace spanned by the two eigenvectors corresponding to \(\lambda _1\) and \(\lambda _n\). So, the SD method often has zigzag phenomenon near the solution. Forsythe [18] generalized Akaike’s results to the so-called optimum s-gradient method and Pronzato et al. [27] further generalized the results to the so-called P-gradient method in the Hilbert space. Recently, by employing Akaike’s results, Nocedal et al. [26] presented some insights for asymptotic behaviors of the SD method on function values, stepsizes and gradient norms.

Contrary to the SD method, the minimal gradient (MG) method (see Dai and Yuan [10]) computes its stepsize by minimizing the gradient norm,

$$\begin{aligned} \alpha _k^{MG}=\arg \min _{\alpha }~\Vert \nabla f(x_k-\alpha g_k)\Vert =\frac{g_{k} ^T Ag_{k}}{g_{k} ^T A^2g_{k}}. \end{aligned}$$
(6)

where \(\Vert \cdot \Vert \) is the 2-norm. It is widely accepted that the MG method can also perform poorly and has similar asymptotic behavior as the SD method, i.e., it will asymptotically zigzag in a two-dimensional subspace. Zhou et al. [33] provided some interesting analyses on \(\alpha _k^{MG}\) for minimizing two-dimensional quadratics. However, rigorous asymptotic convergence results of the MG method for minimizing general quadratic functions are very limit in literature.

In order to avoid the zigzagging pattern, it is useful to determine the stepsize without using the exact stepsize because it would yield a gradient perpendicular to the current one. Barzilai and Borwein [2] proposed the following two novel stepsizes:

$$\begin{aligned} \alpha _k^{BB1}=\frac{s_{k-1} ^T s_{k-1}}{s_{k-1} ^T y_{k-1}}~~\text {and}~~ \alpha _k^{BB2}=\frac{s_{k-1} ^T y_{k-1}}{y_{k-1} ^T y_{k-1}}, \end{aligned}$$
(7)

where \(s_{k-1}=x_k-x_{k-1}\) and \(y_{k-1}=g_k-g_{k-1}\). The Barzilai-Borwein (BB) method (7) performs quite well in practice, though it generates a nonmonotone sequence of objective values. Due to its simplicity and efficiency, the BB method has been widely studied [6,7,8, 17, 28] and extended to general problems and various applications [3, 22,23,24,25, 29]. Another line of research to break the zigzagging pattern and accelerate the convergence is occasionally applying short stepsizes that approximate \(1/\lambda _n\) to eliminate the corresponding component of the gradient. One seminal work is due to Yuan [31, 32], who derived the following stepsize:

$$\begin{aligned} \alpha _k^{Y}=\frac{2}{\frac{1}{\alpha _{k-1}^{SD}}+\frac{1}{\alpha _{k}^{SD}}+ \sqrt{\left( \frac{1}{\alpha _{k-1}^{SD}}-\frac{1}{\alpha _{k}^{SD}}\right) ^2+ \frac{4\Vert g_k\Vert ^2}{(\alpha _{k-1}^{SD}\Vert g_{k-1}\Vert )^2}}}. \end{aligned}$$
(8)

Dai and Yuan [11] further suggested a new gradient method with

$$\begin{aligned} \alpha _k^{DY}=\left\{ \begin{array}{ll} \alpha _k^{SD}, &{}\quad \hbox {if mod}(k,4)<2; \\ \alpha _k^{Y}~~, &{}\quad \hbox {otherwise.} \end{array}\right. \end{aligned}$$
(9)

The Dai-Yuan (DY) method (9) is a monotone method and appears very competitive with the nonmonotone BB method. Recently, by employing the results in [1] and [26], De Asmundis et al. [12] show that the stepsize \(\alpha _k^{Y}\) converges to \(1/\lambda _n\) if the SD method is applied to problem (3). This spectral property is the key to break the zigzagging pattern.

In [9], Dai and Yang developed the asymptotic optimal gradient (AOPT) method whose stepsize is given by

$$\begin{aligned} \alpha _k^{AOPT}=\frac{\Vert g_k\Vert }{\Vert Ag_k\Vert }. \end{aligned}$$
(10)

Unlike the DY method, the AOPT method only has one stepsize. In addition, they show that \(\alpha _k^{AOPT}\) asymptotically converges to \(\frac{2}{\lambda _1+\lambda _n}\), which is in some sense an optimal stepsize since it minimizes \(\Vert I-\alpha A\Vert \) over \(\alpha \) [9, 16]. However, the AOPT method also asymptotically alternates between two directions. To accelerate the AOPT method, Huang et al. [21] derived a new stepsize that converges to \(1/\lambda _n\) during the AOPT iterates and further suggested a gradient method to exploit spectral properties of the stepsizes. For the latest developments of exploiting spectral properties to accelerate gradient methods, see [12,13,14, 20, 21] and references therein.

In this paper, we present the analysis on the asymptotic behaviors of gradient methods and the techniques for breaking the zigzagging pattern. For a uniform analysis, we consider the following stepsize

$$\begin{aligned} \alpha _k=\frac{g_k ^T \varPsi (A) g_k}{g_k ^T \varPsi (A)A g_k}, \end{aligned}$$
(11)

where \(\varPsi \) is a real analytic function on \([\lambda _1,\lambda _n]\) and can be expressed by Laurent series

$$\begin{aligned} \varPsi (z)=\sum _{k=-\infty }^\infty c_kz^k,~~c_k\in \mathbb {R}, \end{aligned}$$

such that \(0<\sum _{k=-\infty }^\infty c_kz^k<+\infty \) for all \(z\in [\lambda _1,\lambda _n]\). Apparently, \(\alpha _k\) is a family of stepsizes that would give a family of gradient methods. When \(\varPsi (A)=A^u\) for some nonnegative integer u, we get the following stepsize

$$\begin{aligned} \alpha _k=\frac{g_k ^T A^{u} g_k}{g_k ^T A^{u+1} g_k}. \end{aligned}$$
(12)

The \(\alpha _k^{SD}\) and \(\alpha _k^{MG}\) simply correspond to the cases \(u=0\) and \(u=1\), respectively.

We will present theoretical analysis on the asymptotic convergence on the family of gradient methods whose stepsize can be written in the form (11), which provides justifications for the zigzag behaviors of all these gradient methods including the SD and MG methods. In particular, we show that each method in the family (11) will asymptotically alternate between two directions associated with the two eigenvectors corresponding to \(\lambda _1\) and \(\lambda _n\). Moreover, we analyze the asymptotic behaviors of the objective value, gradient norm, and stepsize. It is shown that, when \(\varPsi (A)\ne I\), the two sequences \(\Big \{\frac{\varDelta _{2k+1}}{\varDelta _{2k}} \Big \}\) and \(\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k+1}} \Big \}\) may converge at different speeds, while the odd and even subsequences \(\Big \{\frac{\varDelta _{2k+3}}{\varDelta _{2k+1}}\Big \}\) and \(\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k}}\Big \}\) converge at the same rate, where \(\varDelta _k = f(x_k) - f^* \). Similar property is also possessed by the gradient norm sequence. In addition, each method in the family (11) has the same worst asymptotic convergence rate.

In order to accelerate the gradient methods (11), we investigate techniques for breaking the zigzagging pattern. We derive a new stepsize \(\tilde{\alpha }_k\) based on finite termination for minimizing two-dimensional strictly convex quadratic functions. For the n-dimensional case, we prove that \(\tilde{\alpha }_k\) converges to \(1/\lambda _n\) when gradient methods (11) are applied to problem (3). Furthermore, based on this spectral property, we propose a periodic gradient method, which, in a periodic mode, alternately uses the BB stepsize, stepsize (11) and our new stepsize \(\tilde{\alpha }_k\). Numerical comparisons of the proposed method with the BB [2], DY [11], ABBmin2 [19], SDC [12] methods and Alg. 1 in [30] show that the new gradient method is very promising. Our theoretical results also significantly improve and generalize those in [1] and [26], where only the SD method (i.e., \(\varPsi (A)=I\)) is considered. We point out that [27] does not analyze the asymptotic behaviors of the objective value, gradient norm, and stepsize, though (11) is similar to their P-gradient method. Moreover, we develop techniques for accelerating these zigzag methods with simpler analysis. Notice that \(\alpha _k^{AOPT}\) can not be written in the form (11). Thus, our results are not applicable to the AOPT method. On the other hand, the analysis of the AOPT method presented by [9] can not be applied directly to the family of methods (11).

The paper is organized as follows. In Sect. 2, we analyze the asymptotic behaviors of the family of gradient methods (11). In Sect. 3, we accelerate the gradient methods (11) by developing techniques to break the zigzagging pattern and propose a new periodic gradient method. Numerical experiments are presented in Sect. 4. Finally, some conclusions and discussions are made in Sect. 5.

2 Asymptotic Behavior of the Family (11)

In this section, we present a uniform analysis on the asymptotic behavior of the family of gradient methods (11) for general n-dimensional strictly convex quadratics.

Let \(\{\lambda _1,\lambda _2,\cdots ,\lambda _n\}\) be the eigenvalues of A, and \(\{\xi _1,\xi _2,\ldots ,\xi _n\}\) be the associated orthonormal eigenvectors. Note that the gradient method is invariant under translations and rotations when applying to a quadratic function. As pointed out by Fletcher in [17], we can combine those gradient components if there are any multiple eigenvalues. Thus, for theoretical analysis, we assume without loss of generality that

$$\begin{aligned} A=\text {diag}\{\lambda _1,\lambda _2,\cdots ,\lambda _n\},~~0<\lambda _1<\lambda _2<\cdots <\lambda _n. \end{aligned}$$
(13)

Denoting the components of \(g_k\) along the eigenvectors \(\xi _i\) by \(\mu _k^{(i)}\), \(i=1,\ldots ,n\), i.e.,

$$\begin{aligned} g_k=\sum _{i=1}^n\mu _k^{(i)}\xi _i. \end{aligned}$$
(14)

The above decomposition of gradient \(g_k\) together with the update rule (2) gives that

$$\begin{aligned} g_{k+1}=g_k-\alpha _kAg_k=\prod _{j=0}^k(I-\alpha _jA)g_0=\sum _{i=1}^n\mu _{k+1}^{(i)}\xi _i, \end{aligned}$$
(15)

where

$$\begin{aligned} \mu _{k+1}^{(i)}=(1-\alpha _k\lambda _i)\mu _k^{(i)}=\mu _0^{(i)}\prod _{j=0}^k(1-\alpha _j\lambda _i). \end{aligned}$$
(16)

Defining the vector \(q_k=\left( q_k^{(i)}\right) \) with

$$\begin{aligned} q_k^{(i)}=\frac{\left( \mu _k^{(i)}\right) ^2}{\Vert \mu _k\Vert ^2} \end{aligned}$$
(17)

and

$$\begin{aligned} \gamma _k=\frac{1}{\alpha _k}=\frac{g_k ^T \varPsi (A)A g_k}{g_k ^T \varPsi (A) g_k} =\frac{\sum _{i=1}^n\varPsi (\lambda _i)\lambda _iq_k^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)q_k^{(i)}}, \end{aligned}$$
(18)

we can have from (16), (17) and (18) that

$$\begin{aligned} q_{k+1}^{(i)}=\frac{(\lambda _i-\gamma _k)^2q_k^{(i)}}{\sum _{i=1}^n(\lambda _i-\gamma _k)^2q_k^{(i)}}. \end{aligned}$$
(19)

In addition, by the definition of \(q_k\), we know that \(q_k^{(i)}\ge 0\) for all i and

$$\begin{aligned} \sum _{i=1}^nq_k^{(i)}=1,~~\forall ~~k\ge 1. \end{aligned}$$

Before establishing the asymptotic convergence of the family of gradient methods (11), we first give some lemmas on the properties of the sequence \(\{q_k\}\).

Lemma 1

Suppose \(p\in \mathbb {R}^n\) satisfies (i) \(p^{(i)}\ge 0\) for all \(i=1,2,\ldots ,n\); (ii) there exist at least two \(i's\) with \(p^{(i)}>0\); and (iii) \(\sum _{i=1}^np^{(i)}=1\). Define \(T: \mathbb {R}^n\rightarrow \mathbb {R}\) be the following transformation:

$$\begin{aligned} (Tp)^{(i)}=\frac{(\lambda _i-\gamma (p))^2p^{(i)}}{\sum _{i=1}^n(\lambda _i-\gamma (p))^2p^{(i)}}, \end{aligned}$$
(20)

where

$$\begin{aligned} \gamma (p)=\frac{\sum _{i=1}^n\varPsi (\lambda _i)\lambda _ip^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}}. \end{aligned}$$
(21)

Then we have

$$\begin{aligned} \varTheta (Tp)\ge \varTheta (p), \end{aligned}$$
(22)

where

$$\begin{aligned} \varTheta (p)=\frac{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}}. \end{aligned}$$
(23)

In addition, (22) holds with equality if and only if there are two indices, say \(i_1\) and \(i_2\), such that \(p^{(i)}=0\) for all \(i\notin \{i_1,i_2\}\) and

$$\begin{aligned} \gamma (Tp)+\gamma (p)=\lambda _{i_1}+\lambda _{i_2}. \end{aligned}$$
(24)

Proof

It follows from the definition of Tp that

$$\begin{aligned} \varTheta (Tp)&=\frac{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(Tp)^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)(Tp)^{(i)}} \nonumber \\&=\frac{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(\lambda _i-\gamma (p))^2p^{(i)}}{\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)}}. \end{aligned}$$
(25)

Let us define two vectors \(w = (w_i) \in \mathbb {R}^n\) and \(z = (z_i) \in \mathbb {R}^n\) by

$$\begin{aligned} w_i=\sqrt{\varPsi (\lambda _i)}(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))\sqrt{p^{(i)}} \end{aligned}$$

and

$$\begin{aligned} z_i=\sqrt{\varPsi (\lambda _i)}\sqrt{p^{(i)}}. \end{aligned}$$

Then, we have from the Cauchy-Schwarz inequality that

$$\begin{aligned} \Vert w\Vert ^2\Vert z\Vert ^2&=\left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(\lambda _i-\gamma (p))^2p^{(i)}\right) \left( \sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}\right) \nonumber \\&\ge (w ^T z)^2=\left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))p^{(i)}\right) ^2. \end{aligned}$$
(26)

Using the definition of \(\gamma (p)\), we can obtain that

$$\begin{aligned}&\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))p^{(i)} -\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)} \nonumber \\&\quad = (\gamma (p)-\gamma (Tp))\sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))p^{(i)} =0, \end{aligned}$$
(27)

which together with (26) gives

$$\begin{aligned}&\left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (Tp))^2(\lambda _i-\gamma (p))^2p^{(i)}\right) \left( \sum _{i=1}^n\varPsi (\lambda _i)p^{(i)}\right) \nonumber \\&\quad \ge \left( \sum _{i=1}^n\varPsi (\lambda _i)(\lambda _i-\gamma (p))^2p^{(i)}\right) ^2. \end{aligned}$$
(28)

Then, the inequality (22) follows immediately.

The equality in (26) holds if and only if

$$\begin{aligned} \sqrt{\varPsi (\lambda _i)}(\lambda _i-\gamma (Tp))(\lambda _i-\gamma (p))\sqrt{p^{(i)}} =C\sqrt{\varPsi (\lambda _i)}\sqrt{p^{(i)}},~~i=1,\ldots ,n \end{aligned}$$
(29)

for some nonzero scalar C. Clearly, (29) holds when \(p^{(i)}=0\). Suppose that there exist two indices \(i_1\) and \(i_2\) such that \(p^{(i_1)},p^{(i_2)}>0\). It follows from (29) that

$$\begin{aligned} (\lambda _{i_1}-\gamma (Tp))(\lambda _{i_1}-\gamma (p)) =(\lambda _{i_2}-\gamma (Tp))(\lambda _{i_2}-\gamma (p)). \end{aligned}$$

So, by the assumption (13), we have

$$\begin{aligned} \lambda _{i_1}+\lambda _{i_2}=\gamma (Tp)+\gamma (p), \end{aligned}$$

which again with assumption (13) imply that (29) holds if and only if p has only two nonzero components and (24) holds. \(\square \)

Lemma 2

Let \(p_*\in \mathbb {R}^n\) satisfy the conditions of Lemma 1 and T be the transformation (20). If \(p_*\) has only two nonzero components \(p_*^{(i_1)}\) and \(p_*^{(i_2)}\), we have

$$\begin{aligned}&(Tp_*)^{(i_1)}=\frac{\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}, \end{aligned}$$
(30)
$$\begin{aligned}&(Tp_*)^{(i_2)} = \frac{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}, \end{aligned}$$
(31)
$$\begin{aligned}&(T^2p_*)^{(i_1)} = p_*^{(i_1)},~~(T^2p_*)^{(i_2)}=p_*^{(i_2)}, \end{aligned}$$
(32)

and

$$\begin{aligned} \gamma (p_*)+\gamma (Tp_*)=\lambda _{i_1}+\lambda _{i_2}, \end{aligned}$$
(33)

where the function \(\gamma \) is defined in (21). Moreover, \(p_*=Tp_*\) if and only if

$$\begin{aligned} p_*^{(i_1)}=\frac{\varPsi (\lambda _{i_2})}{\varPsi (\lambda _{i_1})+\varPsi (\lambda _{i_2})} \quad \text{ and } \quad p_*^{(i_2)}=\frac{\varPsi (\lambda _{i_1})}{\varPsi (\lambda _{i_1})+\varPsi (\lambda _{i_2})}. \end{aligned}$$
(34)

Proof

By the definition of \(\gamma (p)\), we have

$$\begin{aligned} \gamma (p_*)=\frac{\varPsi (\lambda _{i_1})\lambda _{i_1}p_*^{(i_1)} +\varPsi (\lambda _{i_2})\lambda _{i_2}p_*^{(i_2)}}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}}, \end{aligned}$$
(35)

which indicates that

$$\begin{aligned} \lambda _{i_1}-\gamma (p_*)=\frac{\varPsi (\lambda _{i_2})p_*^{(i_2)}(\lambda _{i_1}-\lambda _{i_2})}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}} \end{aligned}$$

and

$$\begin{aligned} \lambda _{i_2}-\gamma (p_*)=\frac{\varPsi (\lambda _{i_1})p_*^{(i_1)}(\lambda _{i_2}-\lambda _{i_1})}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$

Then, it follows from the definition of transformation T that

$$\begin{aligned} (Tp_*)^{(i_1)}&=\frac{(\varPsi (\lambda _{i_2})p_*^{(i_2)})^2p_*^{(i_1)}}{(\varPsi (\lambda _{i_2})p_*^{(i_2)})^2p_*^{(i_1)}+(\varPsi (\lambda _{i_1})p_*^{(i_1)})^2p_*^{(i_2)}}\\&=\frac{\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$

This gives (30). Equation (31) can be proved similarly. By (30) and (31), we have

$$\begin{aligned} (T^2p_*)^{(i_1)}&=\frac{\varPsi ^2(\lambda _{i_2})(Tp_*)^{(i_2)}}{\varPsi ^2(\lambda _{i_1})(Tp_*)^{(i_1)}+\varPsi ^2(\lambda _{i_2})(Tp_*)^{(i_2)}}\\&=\frac{\varPsi ^2(\lambda _{i_1})\varPsi ^2(\lambda _{i_2})p_*^{(i_1)}}{\varPsi ^2(\lambda _{i_1})\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}+ \varPsi ^2(\lambda _{i_1})\varPsi ^2(\lambda _{i_2})p_*^{(i_1)}}\\&=\frac{p_*^{(i_1)}}{p_*^{(i_1)}+p_*^{(i_2)}}=p_*^{(i_1)}. \end{aligned}$$

The result of \((T^2p_*)^{(i_2)}\) follows similarly. This proves (32).

Again by (30), (31) and the definition of function \(\gamma \) in (21), we have

$$\begin{aligned} \gamma (Tp_*)=\frac{\lambda _{i_1}\varPsi (\lambda _{i_2})p_*^{(i_2)}+ \lambda _{i_2}\varPsi (\lambda _{i_1})p_*^{(i_1)}}{\varPsi (\lambda _{i_1})p_*^{(i_1)}+\varPsi (\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$
(36)

Then, the equality (33) follows from (35) and (36). For (34), let

$$\begin{aligned} p_*^{(i_1)}=\frac{\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}{\varPsi ^2(\lambda _{i_1})p_*^{(i_1)}+\varPsi ^2(\lambda _{i_2})p_*^{(i_2)}}. \end{aligned}$$

Rearranging terms and using \(p_*^{(i_1)}+p_*^{(i_2)}=1\), we have

$$\begin{aligned} \varPsi ^2(\lambda _{i_1})(p_*^{(i_1)})^2=\varPsi ^2(\lambda _{i_2})(p_*^{(i_2)})^2, \end{aligned}$$

which implies that

$$\begin{aligned} \varPsi (\lambda _{i_1})p_*^{(i_1)}=\varPsi (\lambda _{i_2})p_*^{(i_2)}. \end{aligned}$$

This together with the fact \(p_*^{(i_1)}+p_*^{(i_2)}=1\) yields (34). \(\square \)

Lemma 3

Let \(p\in \mathbb {R}^n\) satisfy the conditions of Lemma 1 and T be the transformation (20). Then, there exists a \(p_*\) satisfying

$$\begin{aligned} \lim _{k\rightarrow \infty }T^{2k}p=p_*~~and~~ \lim _{k\rightarrow \infty }T^{2k+1}p=Tp_*, \end{aligned}$$
(37)

where \(p_*\) and \(Tp_*\) have only two nonzero components satisfying

$$\begin{aligned}&p_*^{(i_1)}+p_*^{(i_2)}=1, ~~p_*^{(i)}=0,~~i\ne i_1,i_2, \end{aligned}$$
(38)
$$\begin{aligned}&(Tp_*)^{(i_1)}+(Tp_*)^{(i_2)}=1,~~(Tp_*)^{(i)}=0,~~i\ne i_1,i_2, \end{aligned}$$
(39)

for some \(i_1,i_2\in \{1,\ldots ,n\}\). Hence, (30), (31), (32) and (33) hold.

Proof

Let \(p_0=T^0p=p\) and \(p_{k}=T{p_{k-1}}=T^{k}{p_0}\). Obviously, for all \(k\ge 0\), \(p_{k}\) satisfies (i) and (iii) of Lemma 1. Let \(i_{\min } = \min \{i \in \mathcal {N}: p_0^{(i)}>0\}\) and \(i_{\max } = \max \{i \in \mathcal {N}: p_0^{(i)}>0\}\), where \(\mathcal {N} = \{1, \ldots , n\}\). From the definition of \(\gamma \), we know \(\lambda _{i_{\min }}< \gamma (p)<\lambda _{i_{\max }}\). Thus, by the definition of T, we have \(p_1^{({i_{\min }})}>0\) and \(p_1^{({i_{\max }})}>0\). Then, by induction, for all \(k\ge 0\), \(p_{k}\) satisfies (ii) of Lemma 1. So, by Lemma 1, \(\{\varTheta (p_k)\}\) is a monotonically increasing sequence. Since \(\lambda _1\le \gamma (p)\le \lambda _n\), we have \((\lambda _i-\gamma (p))^2\le (\lambda _n-\lambda _1)^2\). Hence, we have from the definition of \(\varTheta \) that \(\varTheta (p_k)\le (\lambda _n-\lambda _1)^2 \). Thus, \(\{\varTheta (p_k)\}\) is convergent. Let \(\varTheta _*=\lim _{k\rightarrow \infty }\varTheta (p_k) >0\).

Denote the set of all limit points of \(\{p_k\}\) by \(P_*\) with cardinality \(|P_*|\). Since \(\{p_k\}\) is bounded, \(|P_*|\ge 1\). For any subsequence \(\{p_{k_j}\}\) converging to some \(p_*\in P_*\), we have

$$\begin{aligned} \lim _{j\rightarrow \infty }\varTheta (p_{k_j})=\varTheta (p_*) \quad \text{ and } \quad \lim _{j\rightarrow \infty }\varTheta (Tp_{k_j})=\varTheta (Tp_*), \end{aligned}$$

by the continuity of \(\varTheta \) and T. Notice \(p_{k_j+1}=Tp_{k_j}\), we have \( \varTheta _*=\varTheta (p_*)=\varTheta (Tp_*)\).

Since \(p_{k}\) satisfies (i)–(iii) of Lemma 1 for all \(k\ge 0\), \(p_*\) must satisfy (i) and (iii). If \(p_*\) has only one positive component, we have \(\varTheta (p_*)=0\) which contradicts \(\varTheta (p_*) = \varTheta _* > 0 \). Hence, by Lemmas 1, 2 and \(\varTheta (p_*)=\varTheta (Tp_*)\), \(p_*\) has only two nonzero components, say \(p_*^{(i_1)}\) and \(p_*^{(i_2)}\), and their values are uniquely determined by the indices \(i_1\), \(i_2\) and the eigenvalues \(\lambda _{i_1}\) and \(\lambda _{i_2}\). This implies \(|P_*| < \infty \). Furthermore, by Lemma 2, for any \(p_*\in P_*\), \(Tp_*\) is given by (30) and (31), and \(Tp_*\in P_*\).

We now show that \(|P_*|\le 2\) by way of contradiction. Suppose \(|P_*| \ge 3 \). For any \(p_*\in P_*\) and \(Tp_*\in P_*\), denote \(\delta _1\) and \(\delta _2\) to be the distance from \(p_*\) to \(P_*\setminus \{p_*\}\) and from \(Tp_*\) to \(P_*\setminus \{Tp_*\}\), respectively. Since \( 3 \le |P_*| < \infty \), we have \(\delta _1>0\), \(\delta _2>0\) and there exists an infinite subsequence \(\{p_{k_j}\}\) such that

$$\begin{aligned} p_{k_j}\rightarrow p_*, \quad \text{ and } \quad p_{k_j+1}=Tp_{k_j}\rightarrow Tp_*, \end{aligned}$$

but \(p_{k_j+2}\notin \mathcal {B}\left( p_*,\frac{1}{2}\delta \right) \cup \mathcal {B}\left( Tp_*,\frac{1}{2}\delta \right) \), where \(\delta =\min \{\delta _1,\delta _2\}\) and \(\mathcal {B}(p_*,r)=\{p: \Vert p-p_*\Vert \le r\}\). However, by (32) we have \(T^2p_*=p_*\). Hence, by continuity of T,

$$\begin{aligned} \lim _{j\rightarrow \infty }p_{k_j+2}=\lim _{j\rightarrow \infty }Tp_{k_j+1}=\lim _{j\rightarrow \infty }T^2p_{k_j}= p_*, \end{aligned}$$

which contradicts the choice of \(p_{k_j+2}\notin \mathcal {B}\left( p_*,\frac{1}{2}\delta \right) \). Thus, \(\{p_k\}\) has at most two limit points \(p_*\) and \(Tp_*\), and both have only two nonzero components.

Now, we assume that \(p_*\) is a limit point of \(\{p_{2k}\}\). Since \(T^2p_*=p_*\), all subsequences of \(\{p_{2k}\}\) have the same limit point, i.e., \(p_{2k}=T^{2k}p\rightarrow p_*\). Similarly, we have \(T^{2k+1}p\rightarrow Tp_*\). Then, (38) and (39) follow directly from the analysis. \(\square \)

Based on the above analysis, we can show that each gradient method in (11) will asymptotically reduces its search in a two-dimensional subspace spanned by the two eigenvectors \(\xi _1\) and \(\xi _n\).

Theorem 1

Assume that the starting point \(x_0\) has the property that

$$\begin{aligned} g_0 ^T \xi _1\ne 0~~\text {and}~~g_0 ^T \xi _n\ne 0. \end{aligned}$$
(40)

Let \(\{x_k\}\) be the iterations generated by applying a method in (11) to solve problem (3). Then

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{{(\mu _{2k}^{(i)})^2}}{\sum _{j=1}^n(\mu _{2k}^{(j)})^2}= \left\{ \begin{array}{ll} \displaystyle \frac{1}{1+c^2}, &{}\quad \hbox {if}\, i=1, \\ 0, &{}\quad \hbox {if}\, i=2,\ldots ,n-1, \\ \displaystyle \frac{c^2}{1+c^2}, &{}\quad \hbox {if}\, i=n, \end{array}\right. \end{aligned}$$
(41)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{(\mu _{2k+1}^{(i)})^2}{\sum _{j=1}^n(\mu _{2k+1}^{(j)})^2}= \left\{ \begin{array}{ll} \displaystyle \frac{c^2\varPsi ^2(\lambda _{n})}{\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n})}, &{}\quad \hbox {if}\; i=1, \\ 0, &{}\quad \hbox {if}\;i=2,\ldots ,n-1, \\ \displaystyle \frac{\varPsi ^2(\lambda _{1})}{\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n})}, &{}\quad \hbox {if}\; i=n, \end{array}\right. \end{aligned}$$
(42)

where c is a nonzero constant and can be determined by the limit in (48).

Proof

By the assumption (40), we know that \(q_0\) satisfies (i)–(iii) of Lemma 1. Notice that \(q_k = T^k q_0\). Then, by Lemma 3, there exists a \(p_*\) such that the sequences \(\{q_{2k}\}\) and \(\{q_{2k+1}\}\) converge to \(p_*\) and \(Tp_*\), respectively, which have only two nonzero components satisfying (38), (39) for some \(i_1,i_2\in \{1,\ldots ,n\}\), and (32) holds. Hence, if \(1< i_1<i_2<n\), we have

$$\begin{aligned} \lim _{k\rightarrow \infty }q_{2k}^{(n)}=0, \qquad \lim _{k\rightarrow \infty }\frac{q_{2k}^{(i_2)}}{q_{2k+2}^{(i_2)}}=1, \end{aligned}$$
(43)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }(\gamma (q_{2k})+\gamma (q_{2k+1}))= \gamma (p_*)+\gamma (Tp_*) = \lambda _{i_1}+\lambda _{i_2}. \end{aligned}$$

In addition, since \(q_0^{(1)} > 0 \) and \(q_0^{(n)} > 0\) by (40), we can see from the proof of Lemma 3 that \(q_k^{(1)}>0\), \(q_k^{(n)}>0\) for all \(k\ge 0\). Thus, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{q_{2k+2}^{(n)}}{q_{2k}^{(n)}}&=\lim _{k\rightarrow \infty }\frac{q_{2k+2}^{(n)}}{q_{2k}^{(n)}}\frac{q_{2k}^{(i_2)}}{q_{2k+2}^{(i_2)}} =\lim _{k\rightarrow \infty }\frac{(\lambda _n-\gamma (q_{2k+1}))^2(\lambda _n-\gamma (q_{2k}))^2}{(\lambda _{i_2}-\gamma (q_{2k+1}))^2(\lambda _{i_2}-\gamma (q_{2k}))^2} \nonumber \\&=\frac{(\lambda _n-\gamma (p_*))^2(\lambda _n-\gamma (Tp_*))^2}{(\lambda _{i_2}-\gamma (p_*))^2(\lambda _{i_2}-\gamma (Tp_*))^2} >1, \end{aligned}$$
(44)

where the inequality is due to \(\lambda _{i_1}< \gamma (p_*),\gamma (Tp_*) < \lambda _{i_2}\) and \(\lambda _n > \lambda _{i_2}\). So, \(q_{2k}^{(n)}\rightarrow +\infty \), which contradicts (43). Then, we must have \(i_2=n\). In a similar way, we can show that \(i_1=1\). Finally, the equalities in (41) and (42) follow directly from Lemma 2. \(\square \)

In the following, we refer c as the same constant in Theorem 1. By Theorem 1 we can directly obtain the asymptotic behavior of the stepsize.

Corollary 1

Under the conditions of Theorem 1, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha _{2k}= \frac{\varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n})}{\lambda _{1}(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))} \end{aligned}$$
(45)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }\alpha _{2k+1}= \frac{\varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n})}{\lambda _{1}(\kappa \varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n}))}, \end{aligned}$$
(46)

where \(\alpha _k\) is defined in (11) and \(\kappa = \lambda _n/\lambda _1\) is the condition number of A. Moreover,

$$\begin{aligned} \lim _{k\rightarrow \infty }\left( \frac{1}{\alpha _{2k}}+\frac{1}{\alpha _{2k+1}}\right) =\lambda _{1}+\lambda _{n}. \end{aligned}$$
(47)

The next corollary interprets the constant c. A special result for the case \(\varPsi (A)=I\) (i.e., the SD method) can be found in Lemma 3.4 of [26].

Corollary 2

Under the conditions of Theorem 1, we have

$$\begin{aligned} c=\lim _{k\rightarrow \infty }\frac{\mu _{2k}^{(n)}}{\mu _{2k}^{(1)}} =-\frac{\varPsi (\lambda _{1})}{\varPsi (\lambda _{n})}\lim _{k\rightarrow \infty }\frac{\mu _{2k+1}^{(1)}}{\mu _{2k+1}^{(n)}}. \end{aligned}$$
(48)

Proof

It follows from Theorem 1 that

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{(\mu _{2k}^{(n)})^2}{(\mu _{2k}^{(1)})^2} =\frac{\varPsi ^2(\lambda _{1})}{\varPsi ^2(\lambda _{n})} \lim _{k\rightarrow \infty }\frac{(\mu _{2k+1}^{(1)})^2}{(\mu _{2k+1}^{(n)})^2}=c^2. \end{aligned}$$
(49)

Note that \(1/\lambda _n<\alpha _k<1/\lambda _1\) by the assumption (40). And we have by (16) that

$$\begin{aligned} \mu _{2k+2}^{(1)}= \prod _{\ell =1}^2 (1-\alpha _{2k+\ell }\lambda _1)\mu _{2k}^{(1)} \end{aligned}$$

and

$$\begin{aligned} \mu _{2k+2}^{(n)}= \prod _{\ell =1}^2 (1-\alpha _{2k+\ell }\lambda _n)\mu _{2k}^{(n)}. \end{aligned}$$

Thus, the sequence \(\Big \{\frac{\mu _{2k}^{(n)}}{\mu _{2k}^{(1)}}\Big \}\), and similarly for \(\Big \{\frac{\mu _{2k+1}^{(1)}}{\mu _{2k+1}^{(n)}}\Big \}\), do not change its sign. Hence, without loss of generality, we can assume by (49) that

$$\begin{aligned} c=\lim _{k\rightarrow \infty }\frac{\mu _{2k}^{(n)}}{\mu _{2k}^{(1)}}. \end{aligned}$$
(50)

Then, by (16), (45) and (50), we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{\mu _{2k+1}^{(1)}}{\mu _{2k+1}^{(n)}} =\lim _{k\rightarrow \infty }\frac{\mu _{2k}^{(1)}(1-\alpha _{2k}\lambda _1)}{\mu _{2k}^{(n)}(1-\alpha _{2k}\lambda _n)} =-c\frac{\varPsi (\lambda _{n})}{\varPsi (\lambda _{1})}, \end{aligned}$$

which gives (48). \(\square \)

We have the following results on the asymptotic convergence of the function value.

Theorem 2

Under the conditions of Theorem 1, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{f(x_{2k+1})-f^*}{f(x_{2k})-f^*} =R_f^1 \quad \text{ and } \quad \lim _{k\rightarrow \infty }\frac{f(x_{2k+2})-f^*}{f(x_{2k+1})-f^*} =R_f^2, \end{aligned}$$
(51)

where

$$\begin{aligned} R_f^1= & {} \frac{c^2(\kappa -1)^2(\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n}))}{(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2(c^2+\kappa )}, \end{aligned}$$
(52)
$$\begin{aligned} R_f^2= & {} \frac{c^2(\kappa -1)^2(c^2+\kappa )\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2(\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n}))}. \end{aligned}$$
(53)

In addition, if \(\varPsi (\lambda _{n})=\varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\), then \(R_f^1=R_f^2\).

Proof

Let \(\epsilon _k=x_k-x^*\). Since \(g_k=A\epsilon _k\), by (14), we have

$$\begin{aligned} \epsilon _k=\sum _{i=1}^n\lambda _i^{-1}\mu _k^{(i)}\xi _i. \end{aligned}$$

By Theorem 1, we only need to consider the case \(\mu _k^{(i)}=0\), \(i=2,\ldots ,n-1\), that is,

$$\begin{aligned} \epsilon _k=\lambda _1^{-1}\mu _k^{(1)}\xi _1+\lambda _n^{-1}\mu _k^{(n)}\xi _n. \end{aligned}$$

Thus,

$$\begin{aligned} f(x_k)-f^*&=\frac{1}{2}\epsilon _k ^T A\epsilon _k =\frac{1}{2}\frac{\lambda _n(\mu _k^{(1)})^2+\lambda _1(\mu _k^{(n)})^2}{\lambda _1\lambda _n}. \end{aligned}$$
(54)

Since

$$\begin{aligned} g_k=\mu _k^{(1)}\xi _1+\mu _k^{(n)}\xi _n \quad \text{ and } \quad \alpha _k=\frac{\varPsi (\lambda _{1})(\mu _k^{(1)})^2+\varPsi (\lambda _{n})(\mu _k^{(n)})^2}{\lambda _1\varPsi (\lambda _{1})(\mu _k^{(1)})^2+\lambda _n\varPsi (\lambda _{n})(\mu _k^{(n)})^2}, \end{aligned}$$

by the definition of \(\epsilon _k\) and the update rule (2), we further have that

$$\begin{aligned} \epsilon _{k+1}&=\epsilon _k-\alpha _kg_k =(\lambda _1^{-1}-\alpha _k)\mu _k^{(1)}\xi _1+(\lambda _n^{-1}-\alpha _k)\mu _k^{(n)}\xi _n\\&=\frac{\varPsi \left( \lambda _{n}\right) \left( \lambda _n-\lambda _1\right) \left( \mu _k^{(n)}\right) ^2\mu _k^{(1)}}{\lambda _1\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi (\lambda _{n})\left( \mu _k^{(n)}\right) ^2\right) }\xi _1\\&\quad +\frac{\varPsi \left( \lambda _{1}\right) \left( \lambda _1-\lambda _n\right) \left( \mu _k^{(1)}\right) ^2\mu _k^{(n)}}{\lambda _n\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) }\xi _n\\&= \frac{\left( \lambda _n-\lambda _1\right) \left( \lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\mu _k^{(1)}\xi _1 -\lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\mu _k^{(n)}\xi _n\right) }{\lambda _1\lambda _n\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) }. \end{aligned}$$

Hence, we obtain

$$\begin{aligned} f(x_{k+1})-f^*&=\frac{1}{2}\epsilon _{k+1} ^T A\epsilon _{k+1}\nonumber \\&= \frac{1}{2}\frac{\left( \lambda _n-\lambda _1\right) ^2\left( \mu _k^{(1)}\right) ^2\left( \mu _k^{(n)}\right) ^2 \left( \lambda _n\varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\lambda _1\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\lambda _1\lambda _n\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2}. \end{aligned}$$
(55)

Combining (54) with (55) yields that

$$\begin{aligned} \frac{f(x_{k+1})-f^*}{f(x_{k})-f^*}&=\frac{\epsilon _{k+1} ^T A\epsilon _{k+1}}{\epsilon _k ^T A\epsilon _k}\\&= \frac{\left( \mu _k^{(1)}\right) ^2\left( \mu _k^{(n)}\right) ^2\left( \kappa -1\right) ^2 \left( \kappa \varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\left( \varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\kappa \varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2 \left( \kappa \left( \mu _k^{(1)}\right) ^2+\left( \mu _k^{(n)}\right) ^2\right) }, \end{aligned}$$

which gives (51) by substituting the limits of \((\mu _k^{(1)})^2\) and \(\left( \mu _k^{(n)}\right) ^2\) in Theorem 1.

Notice \(\kappa >1\) by our assumption. So, \(R_f^1=R_f^2\) is equivalent to

$$\begin{aligned} \frac{\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n})}{(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2(c^2+\kappa )} =\frac{(c^2+\kappa )\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2(\varPsi ^2(\lambda _{1})+c^2\kappa \varPsi ^2(\lambda _{n}))}, \end{aligned}$$

which by rearranging terms gives

$$\begin{aligned} c^4\varPsi ^2(\lambda _{n})(\varPsi (\lambda _{n})-\varPsi (\lambda _{1}))=\varPsi ^2(\lambda _{1})(\varPsi (\lambda _{n})-\varPsi (\lambda _{1})). \end{aligned}$$

Hence, \(R_f^1=R_f^2\) holds if \(\varPsi (\lambda _{n})=\varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\). \(\square \)

Remark 1

Theorem 2 indicates that, when \(\varPsi (A)=I\) (i.e., the SD method), the two sequences \(\Big \{\frac{\varDelta _{2k+1}}{\varDelta _{2k}} \Big \}\) and \(\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k+1}} \Big \}\) converge at the same speed, where \(\varDelta _k = f(x_k) - f^*\). Otherwise, the two sequences may converge at different rates.

To illustrate the results in Theorem 2, we apply gradient method (11) with \(\varPsi (A)=A\) (i.e., the MG method) to an instance of (3), where the vector of all ones was used as the initial point, the matrix A is diagonal with

$$\begin{aligned} A_{ii}=i\sqrt{i},~~i=1,\ldots ,n, \end{aligned}$$
(56)

and \(b=0\). Fig. 1 clearly shows the difference between \(R_f^1\) and \(R_f^2\).

Fig. 1
figure 1

Problem (56) with \(n=10\): convergence history of the sequences \(\big \{1- \frac{\varDelta _{2k+1}}{\varDelta _{2k}}\big \}\) and \(\big \{1- \frac{\varDelta _{2k+2}}{\varDelta _{2k+1}}\big \}\) generated by gradient method (11) with \(\varPsi (A)=A\) (i.e., the MG method)

The next theorem shows the asymptotic convergence of the gradient norm.

Theorem 3

Under the conditions of Theorem 1, the following limits hold,

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{\Vert g_{2k+1}\Vert ^2}{\Vert g_{2k}\Vert ^2} =R_g^1 \quad \text{ and } \quad \lim _{k\rightarrow \infty }\frac{\Vert g_{2k+2}\Vert ^2}{\Vert g_{2k+1}\Vert ^2} =R_g^2, \end{aligned}$$
(57)

where

$$\begin{aligned} R_g^1= & {} \frac{c^2(\kappa -1)^2(\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n}))}{(1+c^2)(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2}, \end{aligned}$$
(58)
$$\begin{aligned} R_g^2= & {} \frac{c^2(1+c^2)(\kappa -1)^2\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2(\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n}))}. \end{aligned}$$
(59)

In addition, if \(\varPsi (\lambda _{n})=\kappa \varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\), then \(R_g^1=R_g^2\).

Proof

Using the same arguments as in Theorem 2, we have

$$\begin{aligned} \Vert g_{k}\Vert ^2=(\mu _k^{(1)})^2+(\mu _k^{(n)})^2 \end{aligned}$$

and

$$\begin{aligned} \Vert g_{k+1}\Vert ^2&=\epsilon _{k+1} ^T A^2\epsilon _{k+1}\\&=\frac{\left( \lambda _n-\lambda _1\right) ^2\left( \mu _k^{(1)}\right) ^2\left( \mu _k^{(n)}\right) ^2 \left( \varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\left( \lambda _1\varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\lambda _n\varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2}, \end{aligned}$$

which give that

$$\begin{aligned} \frac{\Vert g_{k+1}\Vert ^2}{\Vert g_{k}\Vert ^2} = \frac{(\kappa -1)^2(\mu _k^{(1)})^2(\mu _k^{(n)})^2 \left( \varPsi ^2\left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2+\varPsi ^2\left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2\right) }{\left( \varPsi \left( \lambda _{1}\right) \left( \mu _k^{(1)}\right) ^2+\kappa \varPsi \left( \lambda _{n}\right) \left( \mu _k^{(n)}\right) ^2\right) ^2 \left( \left( \mu _k^{(1)}\right) ^2+\left( \mu _k^{(n)}\right) ^2\right) }. \end{aligned}$$

Thus, (57) follows by substituting the limits of \((\mu _k^{(1)})^2\) and \((\mu _k^{(n)})^2\) in Theorem 1.

Notice \(\kappa >1\) by our assumption. So, \(R_g^1=R_g^2\) is equivalent to

$$\begin{aligned} \frac{\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n})}{(1+c^2)(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2} =\frac{(1+c^2)\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2 (\varPsi ^2(\lambda _{1})+c^2\varPsi ^2(\lambda _{n}))}, \end{aligned}$$

which by rearranging terms gives

$$\begin{aligned} c^4\varPsi ^2(\lambda _{n})(\kappa \varPsi (\lambda _{1})-\varPsi (\lambda _{n}))= \varPsi ^2(\lambda _{1})(\kappa \varPsi (\lambda _{1})-\varPsi (\lambda _{n})). \end{aligned}$$

Hence, \(R_g^1=R_g^2\) holds if \(\varPsi (\lambda _{n})=\kappa \varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\). \(\square \)

Remark 2

Theorem 3 indicates that the two sequences \(\Big \{\frac{\Vert g_{2k+1}\Vert ^2}{\Vert g_{2k}\Vert ^2}\Big \}\) and \(\Big \{\frac{\Vert g_{2k+2}\Vert ^2}{\Vert g_{2k+1}\Vert ^2}\Big \}\) generated by the method (11) with \(\varPsi (A)=A\) (i.e., the MG method) converge at the same rate. Otherwise, the two sequences may converge at different rates.

By Theorems 2 and 3, we can obtain the following corollary.

Corollary 3

Under the conditions of Theorem 1, we have

$$\begin{aligned}&\lim _{k\rightarrow \infty }\frac{f(x_{2k+3})-f^*}{f(x_{2k+1})-f^*} =\lim _{k\rightarrow \infty }\frac{f(x_{2k+2})-f^*}{f(x_{2k})-f^*} =R_f^1R_f^2, \end{aligned}$$
(60)
$$\begin{aligned}&\lim _{k\rightarrow \infty }\frac{\Vert g_{2k+3}\Vert ^2}{\Vert g_{2k+1}\Vert ^2} =\lim _{k\rightarrow \infty }\frac{\Vert g_{2k+2}\Vert ^2}{\Vert g_{2k}\Vert ^2} =R_g^1R_g^2. \end{aligned}$$
(61)

In addition,

$$\begin{aligned} R_f^1R_f^2=R_g^1R_g^2=\frac{c^4(\kappa -1)^4\varPsi ^2(\lambda _{1})\varPsi ^2(\lambda _{n})}{(\varPsi (\lambda _{1})+c^2\kappa \varPsi (\lambda _{n}))^2(c^2\varPsi (\lambda _{n})+\kappa \varPsi (\lambda _{1}))^2}. \end{aligned}$$
(62)

Remark 3

Corollary 3 shows that the odd and even subsequences of objective values and gradient norms converge at the same rate. Moreover, we have

$$\begin{aligned} R_f^1R_f^2=R_g^1R_g^2 =\frac{(\kappa -1)^4}{(1+\kappa /t+t\kappa +\kappa ^2)^2} \le \left( \frac{\kappa -1}{\kappa +1}\right) ^4, \end{aligned}$$
(63)

where \(t= c^2 \varPsi (\lambda _{n})/\varPsi (\lambda _{1})\). Notice that the right side of (63) only depends on \(\kappa \), which implies these odd and even subsequences generated by all the gradient methods (11) will have the same worst asymptotic rate independent of \(\varPsi \).

Now, as in [26], we define the minimum deviation

$$\begin{aligned} \sigma =\min _{i\in \mathcal {I}}~\left| \frac{2 \lambda _i- (\lambda _1+\lambda _n)}{ \lambda _n-\lambda _1 }\right| , \end{aligned}$$
(64)

where

$$\begin{aligned} \mathcal {I}=\{i: \lambda _1<\lambda _i<\lambda _n,~g_0 ^T \xi _i\ne 0,~\text {and}~\lambda _i\ne \alpha _k~\text {for all}~k\}. \end{aligned}$$

Clearly, \(\sigma \in (0,1)\). We now close this section by deriving a bound on the constant c defined in Theorem 1. The following theorem generalizes the results in [1] and [26], where only the case \(\varPsi (A)=I\) (i.e., the SD method) is considered.

Theorem 4

Under the conditions of Theorem 1, and assuming that \(\mathcal {I}\) is nonempty, we have

$$\begin{aligned} \frac{\varPsi (\lambda _{1})}{\varPsi (\lambda _{n})}\frac{1}{\phi _\sigma }\le c^2\le \frac{\varPsi (\lambda _{1})}{\varPsi (\lambda _{n})}\phi _\sigma , \end{aligned}$$
(65)

where

$$\begin{aligned} \phi _\sigma =\frac{2+\eta _\sigma +\sqrt{\eta _\sigma ^2+4\eta _\sigma }}{2} \quad \text{ and } \quad \eta _\sigma =4\left( \frac{1+\sigma ^2}{1-\sigma ^2}\right) . \end{aligned}$$
(66)

Proof

Let \(p=q_0\). By the definition of T, we have that

$$\begin{aligned} \frac{(T^{k+2}p)^{(i)}}{(T^{k+2}p)^{(1)}}= \frac{(T^{k}p)^{(i)}}{(T^{k}p)^{(1)}} \frac{(\lambda _i-\gamma (T^{k}p))^2(\lambda _i-\gamma (T^{k+1}p))^2}{(\lambda _1-\gamma (T^{k}p))^2(\lambda _1-\gamma (T^{k+1}p))^2}. \end{aligned}$$
(67)

It follows from Theorem 1 and Lemma 3 that

$$\begin{aligned} \frac{(T^{k}p)^{(i)}}{(T^{k}p)^{(1)}}\rightarrow 0,~i=2,\ldots ,n-1. \end{aligned}$$
(68)

By the continuity of T and (37) in Lemma 3, we always have that

$$\begin{aligned} \frac{(\lambda _i-\gamma (T^{k}p))^2(\lambda _i-\gamma (T^{k+1}p))^2}{(\lambda _1-\gamma (T^{k}p))^2(\lambda _1-\gamma (T^{k+1}p))^2}\rightarrow \frac{(\lambda _i-\gamma (p_*))^2(\lambda _i-\gamma (Tp_*))^2}{(\lambda _1-\gamma (p_*))^2(\lambda _1-\gamma (Tp_*))^2}, \end{aligned}$$

which together with (67) and (68) implies that

$$\begin{aligned} \frac{(\lambda _i-\gamma (p_*))^2(\lambda _i-\gamma (Tp_*))^2}{(\lambda _1-\gamma (p_*))^2(\lambda _1-\gamma (Tp_*))^2} \le 1,~i=2,\ldots ,n-1, \end{aligned}$$
(69)

where \(p_*\) is the same vector as in Lemma 3. Clearly, (69) also holds for \(i=1\). As for \(i=n\), it follows from (33) in Lemma 2 and Theorem 1 that

$$\begin{aligned} \gamma (p_*)+\gamma (Tp_*)=\lambda _1+\lambda _n, \end{aligned}$$
(70)

which yields that

$$\begin{aligned} \frac{(\lambda _n-\gamma (p_*))^2(\lambda _n-\gamma (Tp_*))^2}{(\lambda _1-\gamma (p_*))^2(\lambda _1-\gamma (Tp_*))^2}=1. \end{aligned}$$

Thus, (69) holds for \(i=1,\ldots ,n\). Hence, we have

$$\begin{aligned}&\left( \lambda _i-\delta -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \lambda _i-\delta -\left( \gamma (Tp_*)-\delta \right) \right) ^2\nonumber \\&\quad \le \left( \lambda _1-\delta -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \lambda _1-\delta -\left( \gamma (Tp_*)-\delta \right) \right) ^2, \end{aligned}$$
(71)

where \(\delta =\frac{\lambda _1+\lambda _n}{2}\). By (70) and (71), we obtain

$$\begin{aligned}&\left( \lambda _i-\delta -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \lambda _i-\delta +\left( \gamma (p_*)-\delta \right) \right) ^2 \\&\quad \le \left( \frac{\lambda _1-\lambda _n}{2} -\left( \gamma (p_*)-\delta \right) \right) ^2 \left( \frac{\lambda _1-\lambda _n}{2} +\left( \gamma (p_*)-\delta \right) \right) ^2, \end{aligned}$$

which implies that

$$\begin{aligned} \left( \frac{\lambda _1-\lambda _n}{2}\right) ^2+ \left( \lambda _i-\delta \right) ^2 \ge 2\left( \gamma (p_*)-\delta \right) ^2. \end{aligned}$$
(72)

By Lemma 2 and Theorem 1, we have that

$$\begin{aligned} \gamma (p_*)=\frac{\lambda _1\varPsi (\lambda _{1})p_*^{(1)}+\lambda _n\varPsi (\lambda _{n})p_*^{(n)}}{\varPsi (\lambda _{1})p_*^{(1)}+\varPsi (\lambda _{n})p_*^{(n)}}. \end{aligned}$$

Substituting \(\gamma (p_*)\) into (72), we obtain

$$\begin{aligned} \left( \frac{\lambda _1-\lambda _n}{2}\right) ^2+ \left( \lambda _i-\delta \right) ^2 \ge \frac{(\lambda _n-\lambda _1)^2(\varPsi (\lambda _{n})c^2-\varPsi (\lambda _{1}))^2}{2(\varPsi (\lambda _{n})c^2+\varPsi (\lambda _{1}))^2}, \end{aligned}$$

which gives

$$\begin{aligned} 4\left( \frac{1+\sigma _i^2}{1-\sigma _i^2}\right) \ge \frac{(c^2\varPsi (\lambda _{n})-\varPsi (\lambda _{1}))^2}{c^2\varPsi (\lambda _{1})\varPsi (\lambda _{n})}, \end{aligned}$$
(73)

where \(\sigma _i=\frac{2 \lambda _i-(\lambda _1+\lambda _n)}{\lambda _n-\lambda _1}\). Noting that (73) holds for all \(i\in \mathcal {I}\). Thus, we have

$$\begin{aligned} \frac{(c^2\varPsi (\lambda _{n})-\varPsi (\lambda _{1}))^2}{c^2\varPsi (\lambda _{1})\varPsi (\lambda _{n})}\le \eta _\sigma , \end{aligned}$$
(74)

which implies (65). This completes the proof. \(\square \)

3 Techniques for Breaking the Zigzagging Pattern

As shown in the previous section, all the gradient methods (11) asymptotically conduct the searches in the two-dimensional subspace spanned by \(\xi _1\) and \(\xi _n\). By (16), if either \(\mu _k^{(1)}\) or \(\mu _k^{(n)}\) equals to zero, the corresponding component will vanish at all subsequent iterations. Hence, in order to break the undesired zigzagging pattern, a good strategy is to employ some stepsize approximating \(1/\lambda _1\) or \(1/\lambda _n\). In this section, we will derive a new stepsize converging to \(1/\lambda _n\) and propose a periodic gradient method using this new stepsize.

3.1 A New Stepsize

Our new stepsize will be derived by imposing finite termination on minimizing two-dimensional strictly convex quadratic function, see [31] for the case of \(\varPsi (A)=I\) (i.e., the SD method).

To proceed, consider the quadratic optimization (3) with \(n=2\). Suppose \(x_3\) is the minimizer after the following 3 iterations:

$$\begin{aligned} x_1&= x_0 - \alpha _0g_0,\\ x_2&= x_1 - \alpha _1g_1,\\ x_3&= x_2 - \alpha _2g_2, \end{aligned}$$

where \(g_i \ne 0\), \(i=0,1,2\), \(\alpha _0\) and \(\alpha _2\) are stepsizes given by (11), and \(\alpha _1\) is a stepsize to be derived.

By the stepsize definition (11), we have

$$\begin{aligned} g_0 ^T \varPsi (A)g_{1}=g_0 ^T \varPsi (A)g_0-\alpha _0g_0 ^T \varPsi (A)Ag_0=0. \end{aligned}$$
(75)

Hence, for any given \(r \in \mathbb {R}\), \(\Big \{\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }, \frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }\Big \}\) forms an orthonormal basis of \(\mathbb {R}^2\). Then, for all \(x\in \mathbb {R}^2\), \(x-x_1\) can be expressed by this basis. Suppose there exist \(t,l\in \mathbb {R}\) such that

$$x-x_1=t\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }+l\frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }.$$

We expand f(x) at \(x_1\) and obtain

$$\begin{aligned} f(x)= & {} f\left( x_1+t\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert } +l\frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }\right) \nonumber \\= & {} f(x_1) +\vartheta ^T \begin{pmatrix} t \\ l \\ \end{pmatrix} + \frac{1}{2} \begin{pmatrix} t \\ l \\ \end{pmatrix} ^T H \begin{pmatrix} t \\ l \\ \end{pmatrix}:=\varphi (t,l), \end{aligned}$$
(76)

where

$$\begin{aligned} \vartheta = B g_1 = \begin{pmatrix} \frac{g_1 ^T \varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }\\ \frac{g_1 ^T \varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert } \end{pmatrix} \text{ with } B=\begin{pmatrix} \frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }, \frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert } \\ \end{pmatrix} ^T \end{aligned}$$
(77)

and

$$\begin{aligned} H= B A B ^T&= \begin{pmatrix} \frac{g_0 ^T \varPsi ^{2r}(A)Ag_0}{\Vert \varPsi ^r(A)g_0\Vert ^2} &{} \frac{g_0 ^T \varPsi (A) Ag_1}{\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } \\ \frac{g_0 ^T \varPsi (A)Ag_1}{\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } &{} \frac{g_1 ^T \varPsi ^{2(1-r)}(A)Ag_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert ^2}\\ \end{pmatrix}\nonumber \\&= \begin{pmatrix} \frac{g_0 ^T \varPsi ^{2r}(A)Ag_0}{\Vert \varPsi ^r(A)g_0\Vert ^2} &{} -\frac{g_1 ^T \varPsi (A) g_1}{\alpha _0\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } \\ -\frac{g_1 ^T \varPsi (A) g_1}{\alpha _0\Vert \varPsi ^r(A)g_0\Vert \Vert \varPsi ^{1-r}(A)g_1\Vert } &{} \frac{g_1 ^T \varPsi ^{2(1-r)}(A)Ag_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert ^2} \\ \end{pmatrix}, \end{aligned}$$
(78)

where last equality is due to \(g_0 ^T \varPsi (A)Ag_1=-\frac{g_1 ^T \varPsi (A)g_1}{\alpha _0}\) which follows from \(g_1=g_0-\alpha _0Ag_0\) and (75). Note that \(B ^T B = B B ^T = I\). Then H is positive definite and has the same eigenvalues as A. Consequently, the minimizer of \(\varphi \) is given by \(\begin{pmatrix} t^* \\ l^* \\ \end{pmatrix} =-H^{-1}\vartheta \).

When \(x_3\) is the solution, we must have

$$\begin{aligned} x_3=x_1+t^*\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }+l^*\frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }=x_1+B ^T \begin{pmatrix} t^* \\ l^* \\ \end{pmatrix}. \end{aligned}$$
(79)

Due to the fact that \(x_3 = x_2 - \alpha _2 g_2\), we know \(x_3-x_2\) is parallel to \(g_2\). It follows from (79) and \(x_2 = x_1 - \alpha _1 g_1\) that

$$\begin{aligned} B ^T \begin{pmatrix} t^* \\ l^* \\ \end{pmatrix} + \alpha _1 g_1 ~~\parallel ~~ g_2, \end{aligned}$$
(80)

which, by noting \(g_2 = g_1 - \alpha _1 Ag_1\) and expressing the two vectors in (80) by the above basis, is equivalent to

$$\begin{aligned} \begin{pmatrix} t^* \\ l^* \\ \end{pmatrix} -(-\alpha _1\vartheta ) = -(H^{-1}\vartheta -\alpha _1 \vartheta ) ~~\parallel ~~ \vartheta +H(-\alpha _1\vartheta ). \end{aligned}$$
(81)

Denote the components of \(\vartheta \) by \(\vartheta _i\), and the components of H by \(H_{ij}\), \(i,j=1,2\). By (81), we would have

$$\begin{aligned} \begin{pmatrix} H_{22}\vartheta _1-H_{12}\vartheta _2-\alpha _1\zeta \vartheta _1 \\ H_{11}\vartheta _2-H_{12}\vartheta _1-\alpha _1\zeta \vartheta _2\\ \end{pmatrix} \quad \text{ and } \quad \begin{pmatrix} \vartheta _1-\alpha _1(H_{11}\vartheta _1+H_{12}\vartheta _2) \\ \vartheta _2-\alpha _1(H_{12}\vartheta _1+H_{22}\vartheta _2)\\ \end{pmatrix}. \end{aligned}$$

are parallel, where \(\zeta =\text {det}(H) = \text {det}(A) >0\). It follows that

$$\begin{aligned}&(H_{22}\vartheta _1-H_{12}\vartheta _2-\alpha _1\zeta \vartheta _1) [\vartheta _2-\alpha _1(H_{12}\vartheta _1+H_{22}\vartheta _2)]\\&\quad = (H_{11}\vartheta _2-H_{12}\vartheta _1-\alpha _1\zeta \vartheta _2) [\vartheta _1-\alpha _1(H_{11}\vartheta _1+H_{12}\vartheta _2)], \end{aligned}$$

which gives

$$\begin{aligned} \alpha _1^2\zeta \varGamma -\alpha _1(H_{11}+H_{22})\varGamma +\varGamma =0, \end{aligned}$$
(82)

where

$$\begin{aligned} \varGamma =(H_{12}\vartheta _1+H_{22}\vartheta _2)\vartheta _1-(H_{11}\vartheta _1+H_{12}\vartheta _2)\vartheta _2. \end{aligned}$$

On the other hand, if (82) holds, we have (80) holds, which by (77), \(H^{-1} = B A^{-1} B ^T\) and \(B ^T B=I\) implies that

$$\begin{aligned} - B ^T H^{-1} \vartheta + \alpha _1 g_1 = - A^{-1} g_1 + \alpha _1 g_1 = - A^{-1}(g_1- \alpha _1 A g_1) = - A^{-1} g_2 \end{aligned}$$

is parallel to \(g_2\). Hence, \(g_2\) is an eigenvector of A, i.e. \(Ag_2 = \lambda g_2\) for some \(\lambda > 0\), since \(g_2 \ne 0\). So, by (11), we get \(\alpha _2 = \varPsi (\lambda ) g_2 ^T g_2 /(\lambda \varPsi (\lambda ) g_2 ^T g_2 ) = 1/\lambda \). Therefore, \(g_3 = g_2 - \alpha _2 A g_2 = g_2 - \alpha _2 \lambda g_2 = 0\), which implies \(x_3\) is the solution. Thus, (82) guarantees \(x_3\) is the minimizer.

Based on the above analysis, to ensure \(x_3\) is the minimizer, we only need to choose \(\alpha _1\) satisfying

$$\begin{aligned} \alpha _1^2\zeta -\alpha _1(H_{11}+H_{22})+1=0, \end{aligned}$$
(83)

whose two positive roots are

$$\begin{aligned} \frac{(H_{11}+H_{22})\pm \sqrt{(H_{11}+H_{22})^2-4\zeta }}{2\zeta }. \end{aligned}$$

These two roots are \(1/\lambda _1\) and \(1/\lambda _2\), where \(0< \lambda _1 < \lambda _2\) are two eigenvalues of A. For numerical reasons (see next subsection), we would like to choose \(\alpha _1\) to be the smaller one \(1/\lambda _2\), which can be calculated as

$$\begin{aligned} \alpha _1&= \frac{2}{(H_{11}+H_{22})+\sqrt{(H_{11}+H_{22})^2-4\zeta }} \nonumber \\&= \frac{2}{(H_{11}+H_{22})+\sqrt{(H_{11}-H_{22})^2+4H_{12}^2}}. \end{aligned}$$
(84)

For general case, since \(g_{k-1} ^T \varPsi (A)g_{k}=0\) we could propose our new stepsize at the k-th iteration as

$$\begin{aligned} \tilde{\alpha }_{k} =\frac{2}{(H_{11}^k+H_{22}^k)+\sqrt{(H_{11}^k-H_{22}^k)^2+4(H_{12}^k)^2}}, \end{aligned}$$
(85)

where \(H_{ij}^k\) is the component of \(H^k\):

$$\begin{aligned} H^k= \begin{pmatrix} \frac{g_{k-1} ^T \varPsi ^{2r}(A)Ag_{k-1}}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2} &{} -\frac{g_k ^T \varPsi (A) g_k}{\alpha _{k-1}\Vert \varPsi ^r(A)g_{k-1}\Vert \Vert \varPsi ^{1-r}(A)g_k\Vert } \\ -\frac{g_k ^T \varPsi (A) g_k}{\alpha _{k-1}\Vert \varPsi ^r(A)g_{k-1}\Vert \Vert \varPsi ^{1-r}(A)g_k\Vert } &{} \frac{g_k ^T \varPsi ^{2(1-r)}(A)Ag_k}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2} \\ \end{pmatrix} \end{aligned}$$
(86)

and \(\alpha _{k-1}\) is given by (11). Clearly, \(\alpha _k^Y\) in (8) can be obtained by setting \(\varPsi (A)=I\) in (86). In addition, by (85) we have that

$$\begin{aligned} \frac{1}{H_{11}^k+H_{22}^k} \le \tilde{\alpha }_k \le \frac{1}{\max \{H_{11}^k,H_{22}^k\}}. \end{aligned}$$
(87)

3.2 Spectral Property of the New Stepsize

The next theorem shows that the stepsize \(\tilde{\alpha }_k\) enjoys desirable spectral property.

Theorem 5

Suppose that the conditions of Theorem 1 hold. Let \(\{x_k\}\) be the iterations generated by any gradient method in (11) to solve problem (3). Then

$$\begin{aligned} \lim _{k\rightarrow \infty }\tilde{\alpha }_k=\frac{1}{\lambda _n}. \end{aligned}$$
(88)

Proof

It follows from (41) and (42) of Theorem 1 that

$$\begin{aligned}&\lim _{k\rightarrow \infty }H_{11}^k =\lim _{k\rightarrow \infty }\frac{g_{k-1} ^T \varPsi ^{2r}(A)Ag_{k-1}}{\Vert g_{k-1}\Vert ^2} \frac{\Vert g_{k-1}\Vert ^2}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2}\\&\quad =\frac{\lambda _1(c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\kappa \varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1}))}{c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1})} \end{aligned}$$

and

$$\begin{aligned} \lim _{k\rightarrow \infty }H_{22}^k&= \frac{g_k ^T \varPsi ^{2(1-r)}(A)Ag_k}{\Vert g_k\Vert ^2} \frac{\Vert g_k\Vert ^2}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2} \\&=\frac{\lambda _1(\varPsi ^{2(1-r)}(\lambda _{1})+\kappa c^2\varPsi ^{2(1-r)}(\lambda _{n}))}{\varPsi ^{2(1-r)}(\lambda _{1})+ c^2\varPsi ^{2(1-r)}(\lambda _{n})}\\&=\frac{\lambda _1(\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+\kappa c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1}))}{\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1})}, \end{aligned}$$

which give

$$\begin{aligned} \lim _{k\rightarrow \infty }(H_{11}^k+H_{22}^k) =\lambda _1(\kappa +1) \end{aligned}$$
(89)

and

$$\begin{aligned} \lim _{k\rightarrow \infty }(H_{11}^k-H_{22}^k) =\frac{\lambda _1(\kappa -1)(\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})- c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1}))}{\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1})}. \end{aligned}$$
(90)

Then, by the definition of \(\alpha _k\), we have

$$\begin{aligned} g_{k} ^T \varPsi (A)g_{k}=-\alpha _{k-1}g_{k-1} ^T \varPsi (A)Ag_{k-1} +\alpha _{k-1}^2g_{k-1} ^T \varPsi (A)A^2g_{k-1}, \end{aligned}$$

which together with (42) in Theorem 1 and (46) in Corollary 1 yields that

$$\begin{aligned} \lim _{k\rightarrow \infty }(H_{12}^k)^2&=\lim _{k\rightarrow \infty }\frac{g_{k} ^T \varPsi (A)g_{k}}{\alpha _{k-1}^2\Vert \varPsi ^r(A)g_{k-1}\Vert ^2} \frac{g_k ^T \varPsi (A) g_k}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2}\\&= \lim _{k\rightarrow \infty } \left( -\frac{1}{\alpha _{k-1}}\frac{g_{k-1} ^T \varPsi (A)Ag_{k-1}}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2} +\frac{g_{k-1} ^T \varPsi (A)A^2g_{k-1}}{\Vert \varPsi ^r(A)g_{k-1}\Vert ^2}\right) \frac{g_k ^T \varPsi (A) g_k}{\Vert \varPsi ^{1-r}(A)g_k\Vert ^2}\\&= \Bigg [-\frac{\lambda _{1}(\kappa \varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n}))}{\varPsi (\lambda _{1})+c^2\varPsi (\lambda _{n})} \frac{\lambda _1(c^2\varPsi (\lambda _{1})\varPsi ^2(\lambda _{n})+\kappa \varPsi (\lambda _{n})\varPsi ^2(\lambda _{1}))}{c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1})} \\&\quad + \frac{\lambda _1^2(c^2\varPsi (\lambda _{1})\varPsi ^2(\lambda _{n})+\kappa ^2\varPsi (\lambda _{n})\varPsi ^2(\lambda _{1}))}{c^2\varPsi ^{2r}(\lambda _{1})\varPsi ^2(\lambda _{n})+\varPsi ^{2r}(\lambda _{n})\varPsi ^2(\lambda _{1})} \Bigg ] \frac{(\varPsi (\lambda _{1})+ c^2\varPsi (\lambda _{n}))\varPsi ^{2r}(\lambda _{1})\varPsi ^{2r}(\lambda _{n})}{\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1})}\\&= \frac{\lambda _1^2c^2(\kappa -1)^2\varPsi ^{2+2r}(\lambda _{1})\varPsi ^{2+2r}(\lambda _{n})}{(\varPsi ^2(\lambda _{1})\varPsi ^{2r}(\lambda _{n})+ c^2\varPsi ^2(\lambda _{n})\varPsi ^{2r}(\lambda _{1}))^2}. \end{aligned}$$

Then, from the above equality and (90), we obtain that

$$\begin{aligned} \lim _{k\rightarrow \infty }\sqrt{(H_{11}^k-H_{22}^k)^2+4(H_{12}^k)^2} =\lambda _1(\kappa -1). \end{aligned}$$
(91)

Combining (89) and (91), we have that

$$\begin{aligned} \lim _{k\rightarrow \infty }\tilde{\alpha }_k= \frac{2}{\lambda _1(\kappa +1)+\lambda _1(\kappa -1)}=\frac{1}{\lambda _n}. \end{aligned}$$

This completes the proof. \(\square \)

Remark 4

When \(r=1\), we have from (87) that \(\tilde{\alpha }_k \le 1/H_{22}^k = \alpha _k^{SD}\). Hence, using the stepsize \(\tilde{\alpha }_k\) will give a monotone gradient method. Theorem 5 indicates that the general \(\tilde{\alpha }_k\) will have the asymptotic spectral property (88), and hence will be asymptotically be smaller than \(\alpha _k^{SD}\) independent of r. But a proper choice r will facilitate the calculation of \(\tilde{\alpha }_k\). This will be more clear in the next section.

Using the similar arguments, we can also show the larger stepsize derived in Sect. 3.1 converges to \(1/\lambda _1\).

Theorem 6

Let

$$\begin{aligned} \bar{\alpha }_k= \frac{2}{(H_{11}^k+H_{22}^k)-\sqrt{(H_{11}^k-H_{22}^k)^2+4(H_{12}^k)^2}}. \end{aligned}$$

Under the conditions of Theorem 5, we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\bar{\alpha }_k=\frac{1}{\lambda _1}. \end{aligned}$$

To present an intuitive illustration of the asymptotic behaviors of \(\tilde{\alpha }_k\) and \(\bar{\alpha }_k\), we applied the gradient method (11) with \(\varPsi (A)=A\) (i.e., the MG method) to minimize the quadratic function (3) with

$$\begin{aligned} A=\text {diag}\{a_1,a_2,\ldots ,a_n\} \quad \text{ and } \quad b=0, \end{aligned}$$
(92)

where \(a_1=1\), \(a_n=n\) and \(a_i\) is randomly generated between 1 and n for \(i=2,\ldots ,n-1\). From Fig. 2, we can see that \(\tilde{\alpha }_k\) approximates \(1/\lambda _n\) with satisfactory accuracy in a few iterations. However, \(\bar{\alpha }_{k}\) converges to \(1/\lambda _1\) even slower than the decreasing of gradient norm. This, to some extent, explains the reason why we prefer \(\tilde{\alpha }_k\) to the short stepsize.

Fig. 2
figure 2

Problem (92) with \(n=1,000\): convergence history of the sequences \(\{\tilde{\alpha }_k\}\) and \(\{\bar{\alpha }_k\}\) for the first 5,000 iterations of the gradient method (11) with \(\varPsi (A)=A\) (i.e., the MG method)

3.3 A Periodic Gradient Method

A method alternately using \(\alpha _k\) in (11) and \(\tilde{\alpha }_k\) to minimize a two-dimensional quadratic function will monotonically decrease the objective value, and terminates in 3 iterations. However, for minimizing a general n-dimensional quadratic function, this alternating scheme may not be efficient for the purpose of vanishing the component \(\mu _k^{(n)}\). One possible reason is that, as shown in Fig. 2, it needs tens of iterations for \(\tilde{\alpha }_k\) to approximate \(1/\lambda _n\) with satisfactory accuracy. In what follows, by incorporating the BB method, we develop an efficient periodic gradient method using \(\tilde{\alpha }_k\).

Figure 3 illustrates a comparison of the gradient method (11) using \(\varPsi (A)=A\) (i.e., the MG method) with a method using 20 BB2 steps first and then MG steps on solving problem (56). We can see that by using some BB2 steps, the modified MG method is accelerated and the stepsize \(\tilde{\alpha }_k\) will approximate \(1/\lambda _n\) with a better accuracy. Thus, our method will run some BB steps first. Now, we investigate the affect of reusing a short stepsize on the performance of the gradient method (11). Suppose that we have a good approximation of \(1/\lambda _n\), say \(\alpha =\frac{1}{\lambda _n+10^{-6}}\). We compare MG method with its two variants by applying (i) \(\alpha _0=\alpha \) or (ii) \(\alpha _0=\ldots =\alpha _9=\alpha \) before using the MG stepsize. Figure 4 shows that reusing \(\alpha \) will accelerate the MG method. Hence, we prefer to reuse \(\tilde{\alpha }_k\) for some consecutive steps when \(\tilde{\alpha }_k\) is a good approximation of \(1/\lambda _n\). Finally, our new method is summarized in Algorithm 1, which periodically applies the BB stepsize, \(\alpha _k\) in (11) and \(\tilde{\alpha }_k\). The R-linear global convergence of Algorithm 1 for solving (3) can be established by showing that it satisfies the property in [5], see the proof of Theorem 3 in [7] for example.

Fig. 3
figure 3

Problem (56) with \(n=10\): convergence history of objective values and stepsizes

Fig. 4
figure 4

Problem (56) with \(n=10\): the MG method (i.e., \(\varPsi (A)=A\)) with different stepsizes

figure a

Remark 5

The BB steps in Algorithm 1 can either employ the BB1 or BB2 stepsize in (7). The idea of using short stepsizes to eliminate the component \(\mu _k^{(n)}\) has been investigated in [12, 13, 20]. However, these methods are based on the SD method, i.e., occasionally applying short steps during the iterates of the SD method. One exception is given by [21], where a method is developed by employing new stepsizes during the iterates of the AOPT method. But our method periodically uses three different stepsizes: the BB stepsize, stepsize (11) and the new stepsize \(\tilde{\alpha }_k\).

The following example shows that both BB steps and \(\tilde{\alpha }_k\)-steps are important to the efficiency of our method. In particular, we consider \(\varPsi (A)=I\) for \(\alpha _k\) in (11) and \(\tilde{\alpha }_k\). We applied Algorithm 1 to a 1000 dimensional quadratic problem in the form (92) with \(a_i=11i-10\) (see [11]). The iteration was stopped once the gradient norm reduces by a factor of \(\epsilon \). Five different values of the parameter pair \((K_b,K_m,K_s)\) were tested. Table 1 presents average number of iterations over ten different initial points randomly chosen from \([-10,10]\) required by the method to meet a given tolerance. We see that, when Algorithm 1 takes no BB steps, i.e. using (0, 60, 10), it performs better than using (10, 0, 0) for a tight tolerance. Notice that Algorithm 1 with (10, 0, 0) reduces to the BB method with \(\alpha _k^{BB1}\). If Algorithm 1 takes no \(\tilde{\alpha }_k\)-steps, i.e. using (50, 60, 0), it is comparable to the case (0, 60, 10). However, when both BB steps and \(\tilde{\alpha }_k\)-steps are taken, i.e. using (50, 60, 10), Algorithm 1 clearly outperforms the other choices especially for a tight tolerance.

Table 1 Average number of iterations required by Algorithm 1 with different values of \((K_b,K_m,K_s)\)

4 Numerical Experiments

In this section, we present numerical comparisons of Algorithm 1 and the following methods: BB with \(\alpha _k^{BB1}\) [2], DY [11], ABBmin2 [19], SDC [12], and Alg. 1 in [30] (A1 for short).

Notice that the stepsize rule for Algorithm 1 can be written as

$$\begin{aligned} \alpha _{k}={\left\{ \begin{array}{ll} \alpha _{k}^{BB},&{} \text {if}\, \text {mod}(k,K_b+K_m+K_s)<K_b; \\ \alpha _{k}(\varPsi (A)),&{} \text {if}\,K_b\le \text {mod}(k,K_b+K_m+K_s)<K_b+K_m; \\ \tilde{\alpha }_{k}(\varPsi (A)),&{} \text {if}\,\text {mod}(k,K_b+K_m+K_s)=K_b+K_m; \\ \alpha _{k-1},&{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(93)

where \(\alpha _k^{BB}\) can be either \(\alpha _k^{BB1}\) or \(\alpha _k^{BB2}\), \(\alpha _{k}(\varPsi (A))\) and \(\tilde{\alpha }_{k}(\varPsi (A))\) are the stepsizes given by (11) and (85), respectively. We tested the following four variants of Algorithm 1 using combinations of the two BB stepsizes and \(\varPsi (A)=I\) or A:

  • BB1SD: \(\alpha _k^{BB1}\) and \(\varPsi (A)=I\) in (93)

  • BB1MG: \(\alpha _k^{BB1}\) and \(\varPsi (A)=A\) in (93)

  • BB2SD: \(\alpha _k^{BB2}\) and \(\varPsi (A)=I\) in (93)

  • BB2MG: \(\alpha _k^{BB2}\) and \(\varPsi (A)=A\) in (93)

Now we derive a formula for the case \(\varPsi (A)=A\), i.e., \(\alpha _{k}(\varPsi (A))=\alpha _{k}^{MG}\). If we set \(r=0\), by (85), we have

$$\begin{aligned} \tilde{\alpha }_k=\frac{2}{\left( \frac{1}{\alpha _{k-1}^{SD}}+ \frac{g_k ^T A^3g_k}{g_k ^T A^2g_k}\right) + \sqrt{\left( \frac{1}{\alpha _{k-1}^{SD}}- \frac{g_k ^T A^3g_k}{g_k ^T A^2g_k}\right) ^2+ \frac{4(g_k ^T Ag_k)^2}{(\alpha _{k-1}^{MG})^2\Vert g_{k-1}\Vert ^2g_k ^T A^2g_k}}}, \end{aligned}$$
(94)

which is expensive to compute directly. However, if we set \(r=1/2\), we get

$$\begin{aligned} \tilde{\alpha }_{k}= \frac{2}{\frac{1}{\alpha _{k-1}^{MG}}+\frac{1}{\alpha _k^{MG}}+ \sqrt{\left( \frac{1}{\alpha _{k-1}^{MG}}-\frac{1}{\alpha _k^{MG}}\right) ^2 +\frac{4g_k ^T A g_k}{(\alpha _{k-1}^{MG})^2g_{k-1} ^T A g_{k-1}}}}. \end{aligned}$$
(95)

This formula can be computed without additional cost because \(g_{k-1} ^T A g_{k-1}\) and \(g_k ^T A g_k\) have been obtained when computing the stepsizes \(\alpha _{k-1}^{MG}\) and \(\alpha _k^{MG}\).

All the methods in consideration were implemented in Matlab (v.9.0-R2016a) and carried out on a PC with an Intel Core i7, 2.9 GHz processor and 8 GB of RAM running Windows 10 system. We stopped the algorithm if the number of iteration exceeds 20,000 or the gradient norm reduces by a factor of \(\epsilon \).

We have tested five different sets of problems in the form of (92) with spectrum distributions described in Table 2, see [21, 30] for example. The rand function in Matlab was used for generating Hessian of the first three problem sets. To investigate performances of the compared methods on different values of Hessian condition number \(\kappa \) and tolerance factor \(\epsilon \), we set \(\kappa \) to \(10^4, 10^5, 10^6\) and \(\epsilon \) to \(10^{-6}, 10^{-9}, 10^{-12}\), respectively. For each value of \(\kappa \) or \(\epsilon \), average results over 10 different starting points with entries randomly generated in \([-10,10]\) are presented in the following tables.

Table 2 Distributions of \(a_j\)

The parameter pair \((K_b,K_m,K_s)\) of our four methods was set to (60, 60, 40) for the first three problem sets, and (100, 60, 50) for the last two problem sets. As in [19], the parameter \(\tau \) of the ABBmin2 method was set to 0.9. For the SDC and A1 methods, we chose parameters to achieve best performance. Particularly, the parameter pair (hs) of SDC was set to (8, 6) for the first three problem sets and (30, 2) for the last two problem sets, while m of A1 was set to 6.

Table 3 Average number of iterations required by the compared methods on the five problem sets in Table 2

Table 3 shows average number of iterations of the compared methods for the five sets of problems listed in Table 2. We can see that, for the first problem set, our proposed methods perform much better than the BB, DY, SDC and A1 methods, although the ABBmin2 method seems surprisingly efficient among the compared methods. Similar results can be observed for the fourth problem set, our proposed methods are comparable to ABBmin2 and faster than other compared methods. Results on the second problem set are slightly different with the former one, where A1 becomes the winner and our proposed methods outperform BB, DY, ABBmin2 and SDC methods. For the third and last problems sets, those results show evidence of our proposed methods over other compared methods, especially when the condition number is large and a tight tolerance is required.

5 Conclusions and Discussions

We present theoretical analyses on the asymptotic behaviors of a family of gradient methods whose stepsize is given by (11), which includes the steepest descent and minimal gradient methods as special cases. It is shown that each method in this family will asymptotically zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian. In order to accelerate the gradient methods, we exploit the spectral property of a new stepsize to break the zigzagging pattern. This new stepsize is derived by imposing finite termination on minimizing two-dimensional strongly convex quadratics and is proved to converge to the reciprocal of the largest eigenvalue of the Hessian for general n-dimensional case. Finally, we propose a very efficient periodic gradient method that alternately uses the BB stepsize, \(\alpha _k\) in (11) and our new stepsize. Our numerical results indicate that, by exploiting the asymptotic behavior and spectral properties of stepsizes, gradient methods can be greatly accelerated to outperform the BB method and other recently developed state-of-the-art gradient methods.

One may also break the zigzagging pattern by employing spectral property in (47). In particular, we could use the following stepsize

$$\begin{aligned} \hat{\alpha }_k=\left( \frac{1}{\alpha _{2k}}+\frac{1}{\alpha _{2k+1}}\right) ^{-1}, \end{aligned}$$
(96)

which, by (47), satisfies that

$$\begin{aligned} \lim _{k\rightarrow \infty }\hat{\alpha }_k = \frac{1}{\lambda _{1}+\lambda _{n}}. \end{aligned}$$

Hence, \( \hat{\alpha }_k\) is also a good approximation of \(1/\lambda _n\) when the condition number \(\kappa =\lambda _n/\lambda _1\) is large. One may see [13] for the case of the SD method.