Abstract
The quadratic termination property is important to the efficiency of gradient methods. We consider equipping a family of gradient methods, where the stepsize is given by the ratio of two norms, with two dimensional quadratic termination. Such a desired property is achieved by cooperating with a new stepsize which is derived by maximizing the stepsize of the considered family in the next iteration. It is proved that each method in the family will asymptotically alternate in a two dimensional subspace spanned by the eigenvectors corresponding to the largest and smallest eigenvalues. Based on this asymptotic behavior, we show that the new stepsize converges to the reciprocal of the largest eigenvalue of the Hessian. Furthermore, by adaptively taking the long Barzilai–Borwein stepsize and reusing the new stepsize with retard, we propose an efficient gradient method for unconstrained quadratic optimization. We prove that the new method is R-linearly convergent with a rate of \(1-1/\kappa\), where \(\kappa\) is the condition number of Hessian. Numerical experiments show the efficiency of our proposed method.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The gradient method is well-known for minimizing a smooth function \(f(x): \mathbb {R}^{n} \rightarrow \mathbb {R}\), which updates the iterates by
where \(g_{k} = \nabla f\left( x_{k}\right)\) and the stepsize \(\alpha _{k}>0\) depends on the method under consideration. The classic steepest descent (SD) [1] and minimal gradient (MG) [2] methods determine \(\alpha _{k}\) by minimizing \(f\left( x_{k}-\alpha g_{k}\right)\) and \(\left\| g\left( x_{k}-\alpha g_{k}\right) \right\| _2\), respectively. Theoretically, the SD and MG methods will asymptotically perform zigzags between two directions, which often yield poor performance in many probelms [3,4,5,6,7].
In 1988, Barzilai and Borwein [8] proposed the following two novel stepsizes from the view of quasi-Newton methods,
and
where \(s_{k-1}=x_{k}-x_{k-1}\) and \(y_{k-1}=g_{k}-g_{k-1}\). Apparently, \(\alpha _{k}^{BB1} \ge \alpha _{k}^{BB2}\) follows from the Cauchy–Schwartz inequality if \(s_{k-1}^{T}y_{k-1} > 0\). When f(x) is a quadratic function
where \(A\in \mathbb {R}^{n\times n}\) is a real symmetric positive definite matrix and \(b\in \mathbb {R}^{n}\), \(\alpha _{k}^{B B 1}\) and \(\alpha _{k}^{B B 2}\) can be regarded as the SD and MG stepsizes with retard, respectively, i.e.
Barzilai and Borwein [8] proved that the BB method is R-superlinearly convergent for the two dimensional strictly convex quadratic function. It has been shown that the BB method is globally convergent [9] with R-linear rate [10] for any dimensional cases. Although the BB method is nonmonotone, extensive numerical experimental results indicate that it performs much better than the SD method [11,12,13]. See [14, 15, 16, 17, 18, 19, 20 and 21] for more BB-like methods.
In [22], Yuan derived a new stepsize, which together with the SD method produces the minimizer of a two dimensional strictly convex quadratic function in three iterations. In what follows, if a method can give the exact minimizer of a two dimensional convex quadratic function within finite iterations, we call it has the property of two dimensional quadratic termination. Based on the following variant of the Yuan stepsize,
Dai and Yuan [23] suggested the so-called Dai–Yuan (DY) gradient method with
Clearly, \(\alpha _{k}^{DY}\le \min \{\alpha _{k}^{SD},\alpha _{k-1}^{SD}\}\), which implies that the DY method is monotone. Interestingly, the DY method can even outperform the nonmonotone BB method. Recently, Huang et al. [5] derived a new stepsize, say \(\alpha _{k}^{H}\), such that the gradient method
together with \(\alpha _{k}^{H}\) achieves the two dimensional quadratic termination, where \(\psi\) is a real analytic function on \([\lambda _{1},\lambda _{n}]\) and can be expressed by Laurent series
such that \(0<\sum _{k=-\infty }^{\infty } d_{k} z^{k}<+\infty\) for all \(z \in \left[ \lambda _{1}, \lambda _{n}\right]\). Here, \(\lambda _{1}\) and \(\lambda _{n}\) are the smallest and largest eigenvalues of A, respectively. Furthermore, \(\alpha _{k}^{H}\) reduces to \(\alpha _{k}^{DY}\) when \(\psi (A)=I\). The property of two dimensional quadratic termination has shown great potential in improving performances of gradient methods, see [24, 5, 25] for example.
To our knowledge, there is still lack of theoretical analysis for the two dimensional quadratic termination of the Dai–Yang method [26] whose stepsize is given by
A remarkable property of the Dai–Yang method is that \(\alpha _{k}^{P}\) converges to the optimal stepsize \(\frac{2}{\lambda _{1}+\lambda _{n}}\) (in the sense that it minimizes the modulus \(\Vert I-\alpha A\Vert _{2}\), see [26, 27]). Moreover, the Dai–Yang method is able to find the eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\).
In this paper, for a uniform analysis, we consider equipping the family
with the two dimensional quadratic termination property, which will be achieved by cooperating with
The above strategy of maximizing the stepsize value in the next iteration has been employed in [28] for the SD method. However, the analysis in [28] can not be directly applied to the family (9). Clearly, \(\alpha _{k}^{P}\) corresponds to the case \(\psi (A) = I\) in (9). We prove that each method in the family (9) will asymptotically alternate in a two dimensional subspace associated with the two eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\). In addition, for any given \(\psi\), the stepsize (9) tends to the above optimal stepsize as \(k\rightarrow \infty\), and the eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\) can be obtained. Then, we show that \(\lim _{k \rightarrow \infty }\tilde{\alpha }_{k} =1 /\lambda _{n}\) for n dimensional strictly convex quadratics. By adaptively taking \(\alpha _{k}^{BB1}\) and reusing \(\tilde{\alpha }_{k-1}\) for some iterations, we propose a new method for quadratic minimization problems. It is proved that the proposed method is R-linearly convergent with the rate of \(1-1/\kappa\), where \(\kappa =\lambda _{n} / \lambda _{1}\) is the condition number of A. Our numerical comparisons with the BB1 [8], DY [23], SL (Alg.1 in [25]), ABBmin2 [28], SDC [29], and MGC [7] methods for solving unconstrained random and non-random quadratic optimization demonstrate that the proposed method is very efficient. Further, numerical experiments on quadratic problems whose Hessians are chosen from the SuiteSparse Matrix Collection [30] suggest that the proposed method is very competitive with the above methods.
The paper is organized as follows. In Sect. 2, we derive the new stepsize \(\tilde{\alpha }_{k}\) and analyze its properties. The asymptotic behavior of the family (9) is also analyzed. Our new algorithm for quadratic minimization problems as well as its R-linear convergence are presented in Sect. 3. Section 4 presents some numerical comparisons of the proposed method and other successful gradient methods on solving quadratic problems. Finally, in Sect. 5 we give some concluding remarks.
2 A new stepsize and its properties
In this section, we derive the formula of \(\tilde{\alpha }_{k}\) and analyze its properties.
To obtain \(\tilde{\alpha }_{k}\), we consider
The maximum value of \(F(\alpha _{k})\) is achieved when \(F'(\alpha _{k}) = 0\), which holds for any \({\alpha }_{k}\) satisfying
where \(\phi _{1}=c_{1} c_{4}-c_{2} c_{3}\), \(\phi _{2}=c_{0} c_{4}-c_{2}^{2}\) and \(\phi _{3}=c_{0} c_{3}-c_{1}c_{2}\) with
The following lemma guarantees that (12) has two roots.
Lemma 1
Assume \(g_{k} \ne 0\) and \(g_{k}\) is not parallel to \(Ag_{k}\). Then, \(\phi _{1}, \phi _{2}, \phi _{3} > 0\) and \(\phi _{2}^2 - 4\phi _{1}\phi _{3} > 0\).
Proof
It follows from the Cauchy-Schwartz inequality and (13) that
for \(j\ge 0\). That is, \(c_{j} / c_{j+1} > c_{j+1}/ c_{j+2}\), which implies \(\phi _{1}, \phi _{2}, \phi _{3} > 0.\)
By direct calculation, we obtain \(c_{1} \phi _{2} = c_{0} \phi _{1} + c_{2} \phi _{3}\), which implies that
Combining with \(c_{0} c_{2}>c_{1}^{2}\), we have \(\phi _{2}^2 - 4\phi _{1}\phi _{3} > 0\). This completes the proof.\(\square\)
Using the square root law, we get the two roots of (12) as
and
It is easy to see that \(\tilde{\alpha }_{k} = \arg \max _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}\) and \(\hat{\alpha }_{k} =\arg \min _{\alpha _{k} \in \mathbb {R}}~\alpha _{k+1}\).
Next theorem presents the two dimensional quadratic termination of the gradient method using \(\alpha _{k}\) in (9) and \(\tilde{\alpha }_{k}\).
Theorem 1
(Two dimensional quadratic termination) Consider the gradient method (1) for minimizing the two dimensional quadratic function (4). If the stepsize \(\alpha _{k}\) is given by (9) for all \(k \ne k_{0}\) and \(\alpha _{k_{0}}=\tilde{\alpha }_{k_{0}}\) at the \(k_{0}\)-th iteration where \(k_{0} \ge 1\), it holds that \(g_{k_{0}+i}=0\) for some \(1 \le i \le 3\).
Proof
Without loss of generality, we assume that \(A={\text {diag}}\{1, \lambda \}\) with \(\lambda >0\). Let \(g_{k}^{(1)}\) and \(g_{k}^{(2)}\) be the first and second components of \(g_{k}\), respectively. Notice that
After direct calculation and simplification, we get
Therefore, we obtain
Thus, from (14) we know that \(\tilde{\alpha }_{k} = 1/\lambda\) for all \(k\ge 1\). The conclusion follows immediately from \(\alpha _{k_{0}}=\tilde{\alpha }_{k_{0}}\) and \(g_{k+1}=\left( I-\alpha _{k}A\right) g_{k}\). We complete the proof.\(\square\)
In what follows, we shall prove that \(\tilde{\alpha }_{k}\) converges to \(1/\lambda _{n}\) under each method in the family (9). To this aim, we have to analyze the asymptotic behavior of the family (9) first.
For convenience, we assume without loss of generality that the matrix A is diagonal with distinct eigenvalues, i.e.
Let \(\left\{ \xi _{1}, \xi _{2}, \ldots , \xi _{n}\right\}\) be the set of orthogonal eigenvectors associated with the eigenvalues \(\left\{ \lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\right\}\). Denoting by \(\mu _{k}^{(i)}, i=1, \ldots , n\), the components of \(g_{k}\) along \(\xi _{i}\), i.e.
It follows from (1) and (17) that
where
The following lemma is useful in our analysis.
Lemma 2
[26] Let p be a vector in \(\mathbb {R}^{n}\) such that (i) \(p^{(1)}>0\) and \(p^{(n)}>0\) (ii) \(p^{(1)} + p^{(n)} = 1\). Further assume that \(0<\lambda _{1}<\cdots <\lambda _{n}\). Consider a transformation T such that
Then
where
Based on Lemma 2, we are able to show that each method in the family (9) will zigzag in a two dimensional subspace which generalizes the results in [26], where the case \(\psi (A)=I\) (i.e. the Dai–Yang method) is considered.
Theorem 2
Assume that the starting point \(x_{0}\) is such that
Let \(\{x_{k} \}\) be the sequence generated by applying a method in the family (9). Then
and
where \(h_{1}\) and \(h_{2}\) are defined in (20). Further, the vectors
tend to be the eigenvectors corresponding to \(\lambda _{1}\) and \(\lambda _{n}\) of A, respectively.
Proof
By (17), we get
where \(\eta _{k}^{(i)} = \psi (\lambda _{i}) \mu _{k}^{(i)}\), which together with (19) yields
Defining the vector \(p_{k}=\left( p_{k}^{(i)}\right)\) with
and
We have from (24), (25) and (26) that
Clearly, according to the definition of \(p_{k}\), we get that \(p_{k}^{(i)}\ge 0\) for all i and
Let \(p=p_{1} \in \mathbb {R}^{n}\), based on the above analysis and Lemma 2, we know that \(\lim _{k \rightarrow \infty } p_{k}=\left( h_{1}, 0, \ldots , 0, h_{2}\right) ^{T}\), where \(h_{1}\) and \(h_{2}\) are given in (20). It follows from (24) and \(\lambda _{n}^{-1}<\alpha _{k}<\lambda _{1}^{-1}\) that
and
Thus, by Lemma 2, (27) and (28), we know that (21) and (22) hold.
Furthermore, combining (23), (27) and (28), we find that
and
This completes our proof.\(\square\)
From Theorem 2, we have the following asymptotic result of the stepsize (9).
Corollary 1
Under the conditions of Theorem 2, for \(\alpha _{k}\) in (9) it holds that
Next theorem shows that \(\tilde{\alpha }_{k}\) converges to \(1/\lambda _{n}\) under each method in the family (9).
Theorem 3
Under the conditions of Theorem 2, let \(\{g_{k}\}\) be the sequence generated by applying a method in the family (9) to minimize the n-dimensional quadratic function (4). Then \(\lim _{k \rightarrow \infty }\tilde{\alpha }_{k}=1/\lambda _{n}\).
Proof
When k is odd, by the definition of \(\phi _{1}\), (22) and (29), we get
Similarly, we have
and
Combining (30), (31) and (32), we obtain
It follows from (33) and (14) that \(\lim _{k \rightarrow \infty } \tilde{\alpha }_{k}=1/\lambda _{n}\). When k is even, we get the desired result in the same manner as above. This completes our proof. \(\square\)
From (33), we see that \(\phi _{1}/\phi _{3}\) and \(\phi _{2}/\phi _{3}\) are independent of \(\psi (A)\). The following example shows that \(\tilde{\alpha }_{k}\) converges to \(1 / \lambda _{n}\) under both the MG and Dai–Yang methods. In particular, we applied the MG and Dai–Yang methods to the quadratic function (4) with
where \(a_{1}=1,~a_{n}=n\) and \(a_{i}\) was randomly generated in (1, n), \(i=2, \ldots , n-1\). The starting point was set to the vector of all ones. From Fig. 1, we see that \(\tilde{\alpha }_{k}\) approximates \(1 / \lambda _{n}\) with satisfactory accuracy after a small number of iterations under both the MG and Dai–Yang methods.
3 A new gradient method
In this section, we propose a new algorithm for unconstrained quadratic optimization and present its R-linear convergence result.
Extensive studies point out that adaptively choosing a short stepsize or \(\alpha _{k}^{BB1}\) at each iteration is numerically better than the original BB method, see for example [31, 32, 33, 28, 34, 35]. Now we show that the new stepsize \(\tilde{\alpha }_{k}\) is a short one.
Theorem 4
Under the conditions of Lemma 1, it holds that \(\tilde{\alpha }_{k} < c_{2}/c_{3}\).
Proof
According to (13) and (14), we get
If \(c_{2}\phi _{2} - 2 c_{3} \phi _{3} > 0\), we have \(c_{2}/c_{3} - \tilde{\alpha }_{k} > 0\); otherwise, it follows from (35) that
where the last equality is due to \(c_{3}\phi _{2} = c_{2}\phi _{1} +c_{4} \phi _{3}\) and \(\Delta = c_{2} \sqrt{\phi _{2}^{2}-4\phi _{1}\phi _{3}} -\left( c_{2}\phi _{2} - 2 c_{3}\phi _{3}\right) >0\). From \(\phi _{1}, \phi _{3}>0\) and \(c_{2}c_{4}-c_{3}^2 > 0\), we know that \(\tilde{\alpha }_{k}<c_{2}/c_{3}\) holds. This completes our proof.\(\square\)
Based on the above analysis, we can develop gradient method using \(\tilde{\alpha }_{k}\) and \(\alpha _{k}^{BB1}\) in an adaptive way. Notice that reusing the retard short stepsize for some iterations could reduce the computational cost and yield better performance, see [29, 36, 25, 13, 7]. So, we suggest to combine the adaptive and cyclic schemes with \(\alpha _{k}^{BB1}\) and \(\tilde{\alpha }_{k-1}\). In particular, our method reuses \(\tilde{\alpha }_{k-1}\) for r iterations when \(\alpha _{k}^{\mathrm {BB} 2} / \alpha _{k}^{\mathrm {BB} 1}<\tau\) for some \(\tau \in (0,1)\); otherwise, we set \(\alpha _{k} = \alpha _{k}^{BB1}\). We use t as the index to keep track of the number of short stepsizes chosen during the iterative process. The stepsize for our algorithm is summarized as
We mention that the parameter \(\tau\) can be chosen dynamically as [24]. However, in our test the dynamic scheme does not show much evidence over the above fixed one. Our method is formally presented in Algorithm 1.
To establish R-linear convergence of Algorithm 1, we first show the boundedness of \(\tilde{\alpha }_{k}\).
Let us denote
and
Using the same arguments as (12) and Lemma 1, we know that both \(f_{1}'(\alpha )=0\) and \(f_{2}'(\alpha )=0\) have two roots. In addition, the roots of \(f_{1}'(\alpha )=0\), say \(\tilde{\beta }_{k}\) and \(\hat{\beta }_{k}\) with \(\tilde{\beta }_{k}\le \hat{\beta }_{k}\), are the solutions of
and the roots of \(f_{2}'(\alpha )=0\), say \(\tilde{\gamma }_{k}\) and \(\hat{\gamma }_{k}\) with \(\tilde{\gamma }_{k}\le \hat{\gamma }_{k}\), satisfy
where \(\phi _{1}\), \(\phi _{3}\) are defined in the former section, and \(\phi _{4}=c_{1} c_{3}-c_{2}^{2}\), \(\phi _{5}=c_{0} c_{2}-c_{1}^{2}\) and \(\phi _{6}=c_{2} c_{4}-c_{3}^{2}\). Recall that \(F(\alpha )=f_{1}(\alpha ) f_{2}(\alpha )\). Since \(f_{1}(\alpha )>0\) and \(f_{2}(\alpha )>0\), any root of \(F'(\alpha ) = f_{1}'(\alpha )f_{2}(\alpha ) + f_{1}(\alpha )f_{2}'(\alpha )=0\) yields
If the root is such that \(f_{1}'(\alpha )\le 0\) and \(f_{2}'(\alpha )\ge 0\), we must have
Similarly, for the case \(f_{1}'(\alpha )\ge 0\) and \(f_{2}'(\alpha )\le 0\), we get
Now we are ready to prove that \(\tilde{\alpha }_{k}\) is bounded between \(1/\lambda _{n}\) and \(1/\lambda _{1}\).
Theorem 5
Under the conditions of Lemma 1, it holds that \(1/\lambda _{n}\le \tilde{\alpha }_{k} \le 1/\lambda _{1}\).
Proof
Since \(F'(\tilde{\alpha }_{k}) =0\), when \(f_{1}'(\tilde{\alpha }_{k})\le 0\), it follows from (41) that
For the case \(f_{1}'(\tilde{\alpha }_{k})\ge 0\), by (42), we get
Thus, we only need to prove that \(1/\lambda _{n}\le \hat{\beta }_{k},~ \tilde{\beta }_{k},~\hat{\gamma }_{k},~ \tilde{\gamma }_{k} \le 1/\lambda _{1}\).
To proceed, we first show that
Since \(f_{1}'(\tilde{\beta }_{k}) = 0\), we have
which implies that
where the last equality comes from the fact that \(\tilde{\beta }_{k}\hat{\beta }_{k}=\phi _{5}/\phi _{4}\).
In order to prove (43), we are suffice to show that
that is,
which holds due to \(\tilde{\beta }_{k}+\hat{\beta }_{k}=\phi _{3}/\phi _{4}\) and the definitions of \(c_{1}, c_{2}, c_{3}, \phi _{3}, \phi _{4}, \phi _{5}\).
Thus, (43) is true. It follows from the definition of \(f_1\) and the Rayleigh’s quotient property that \(1/\lambda _{n} \le \hat{\beta }_{k}=f_{1}(\tilde{\beta }_{k})\le 1/\lambda _{1}\).
Using the same arguments as above, we get \(1/\lambda _{n} \le \tilde{\beta }_{k},~\tilde{\gamma }_{k},~\hat{\gamma }_{k} \le 1/\lambda _{1}\). This completes the proof. \(\square\)
In [37], the authors prove that any gradient method with stepsizes satisfying the following Property B has R-linear convergence rate \(1-\lambda _{1}/M_1\) which implies a \(1-1/\kappa\) rate when \(M_1\le \lambda _n\). Similar results for gradient methods satisfying the Property A in [38] can be found in [39]. However, a stepsize satisfies Property B may not meets the conditions of Property A.
Property B: We say that the stepsize \(\alpha _{k}\) has Property B if there exist an integer m and positive constant \(M_{1} \ge \lambda _{1}\) such that
-
(i)
\(\lambda _{1} \le \alpha _{k}^{-1} \le M_{1}\);
-
(ii)
for some real analytic function \(\psi\) defined as (7) and \(v(k) \in \{k,k-1,\ldots ,\max \{k-m+1,0\}\}\),
$$\begin{aligned} \alpha _{k} \le \frac{g_{v(k)}^{T} \psi ^{2}(A) g_{v(k)}}{g_{v(k)}^{T} A\psi ^{2}(A) g_{v(k)}}. \end{aligned}$$(44)
Clearly, \(\alpha _k^{BB1}\) satisfies Property B with \(M_1=\lambda _n\), \(m=2\) and \(\psi ^{2}(A)=I\), and the new stepsize \(\tilde{\alpha }_{k}\) satisfies Property B with \(M_1=\lambda _n\). So, we are able to show R-linear convergence of Algorithm 1 with \(1-1/\kappa\) rate as Theorem 2 in [37]. For completeness, we include the proof here.
Theorem 6
Suppose that the sequence \(\left\{ g_{k}\right\}\) is generated by Algorithm 1. Then either \(g_{k} = 0\) for some finite k or the sequence \(\left\{ \left\| g_{k}\right\| _{2}\right\}\) converges to zero R-linearly in the sense that
where \({\theta }=1-1/\kappa\) and
with \(\sigma _{i}=\max \left\{ \frac{\lambda _{i}}{\lambda _{1}}-1,1-\frac{\lambda _{i}}{\lambda _{n}}\right\}\).
Proof
It follows from \(g_{k+1}^{(i)}=\left( 1-\alpha _{k} \lambda _{i}\right) g_{k}^{(i)}\) and (i) of Property B that
Clearly, (45) holds for \(i=1\).
In what follows, we prove (45) by induction on i. Assume that (45) holds for all \(1 \le i \le L-1\) with \(L \in \{2, \ldots , n\}\). When \(i=L\), it follows from the definition of \(C_{L}\) that (45) holds for \(k = 0, 1,\ldots , r-1\). Assume by contradiction that (45) does not hold for \(k\ge r\) when \(i=L\). Let \(\hat{k}\ge r\) be the minimal index such that \(\left| g_{\hat{k}}^{(L)}\right| > C_{L} \theta ^{\hat{k}}\). Then, if \(\alpha _{\hat{k}-1} \lambda _{L} \le 1\), we get
which contradicts our assumption. Thus we must have \(\alpha _{\hat{k}-1} \lambda _{L} > 1\), which combines with (46) and \(\theta <1\) gives that
where \(G(k, l)=\sum _{i=1}^{l}\left( \psi \left( \lambda _{i}\right) g_{k}^{(i)}\right) ^{2}\). This together with Theorem 4 yields that
Using similar arguments, we can show \(\left| 1-\alpha _{\hat{k}}^{BB1}\lambda _{L}\right| \le \theta\). Thus, \(\left| g_{\hat{k}}^{(L)}\right| \le \theta \left| g_{\hat{k}-1}^{(L)}\right| \le C_{L} \theta ^{\hat{k}}\) which contradicts our assumption. Hence (45) holds for all i. We complete the proof. \(\square\)
4 Numerical experiments
In this section, we provide numerical experiments of Algorithm 1 on solving quadratic problems. All our codes are written in MATLAB R2016b and carried out on a PC with an AMD Ryzen 5 PRO 2500U, 2.00 GHz processor and 8 GB of RAM running Windows 10 system.
For Algorithm 1, we use \(\tilde{\alpha }_{k}\) with \(\psi (A) =I\). Then, if we compute the vector \(z = Ag_{k}\) at each iteration, and keep into memory \(w = Ag_{k-1}\), \(\tilde{\alpha }_{k-1}\) can be expressed as
Hence only one matrix-vector product is required in per iteration for Algorithm 1.
Firstly, we compare Algorithm 1 with the BB1 [8], DY [13], ABBmin2 [28], SDC [29] and MGC [7] methods and the Alg.1 method in [25] (we note this method as SL) on solving the following quadratic problem [22]:
where \(V = \mathrm {diag}\left\{ v_{1}, \ldots , v_{n}\right\}\) is a diagonal matrix and \(x^{*}\) is randomly generated with components in \([-10,10]\). Five different distributions of the diagonal elements \(v_{j}, j=1,2, \ldots , n\), are generated, see Table 1.
The parameter m in SL uses the value \(m=5\) for the second problem set and \(m = 6\) for other sets to get good performance. For ABBmin2, \(\tau\) is set to 0.9 as suggested in [28]. For SDC and MGC, the parameter pairs (h, s) and \((d_{1}, d_{2})\) are all set to (8, 6), which are more efficient than other choices for this test. The parameters in Algorithm 1 are set to \(\tau = 0.3\) and \(r = 5\).
For all comparison methods, the stopping condition is \(\left\| g_{k}\right\| _{2} \le \epsilon \left\| g_{0}\right\| _{2}\), where \(\epsilon >0\) is a given tolerance. The problem dimension is set to \(n = 1000\). For each problem set, three different tolerance parameters \(\epsilon = 10^{-6}, 10^{-9}, 10^{-12}\) and condition numbers \(\kappa = 10^{4}\), \(10^{5}, 10^{6}\) are tested. For each value of \(\kappa\) or \(\epsilon\), 10 starting points are randomly generated in \([-10, 10]\) and the average results are presented in Table 2.
We see that Algorithm 1 clearly outperforms the BB1, DY, SDC and MGC methods. As compared with the SL method, Algorithm 1 is usually faster than it when a high accuracy is required. Moreover, Algorithm 1 often performs better than ABBmin2 for the second to fourth problem sets and is very competitive with ABBmin2 for the first and last problem sets.
Next, we compare the above methods on problem (4), where \(b = Ax^{*}\) and \(x^{*}\) is a random vector as before. We test 14 sparse matrices from the SuiteSparse Matrix Collection [30] listed in Table 3. The iteration stops when \(\left\| g_{k}\right\| _{2} \le 10^{-6}\left\| g_{0}\right\| _{2}\). In our test, we choose the parameters so that each method achieves the best performance. Specifically, we set \(m = 4\) for SL, \((h, s) = (3, 4)\) for SDC, \((d_{1},d_{2}) = (4,4)\) for MGC, \(\tau = 0.3\) for ABBmin2 and \(\tau = 0.1~\text { and }~r = 5\) for Algorithm 1 .
For each matrix, ten initial points between \(-10\) and 10 are randomly generated. The average number of iterations of compared methods are listed in Table 4. We see that for most matrices Algorithm 1 has better performance than BB1, DY, SL, MGC, SDC and is competitive with ABBmin2 in the sense of number of iterations. Table 5 lists the average CPU time in seconds for those methods. We observe that Algorithm 1 takes less CPU time than BB1, DY, MGC, ABBmin2, and is as fast as SL and SDC. One important reason for this phenomenon is that our method reuses the stepsize \(\tilde{\alpha }_{k-1}\) for r iterations which can reduce computational cost.
5 Conclusions and discussions
For a uniform analysis, we considered a family of gradient methods whose stepsize is provided by \(\alpha _{k}\) in (9), which includes the Dai–Yang method (5) as a special case. It is proved the family zigzags between two directions in a subspace spanned by the two eigenvectors corresponding to the smallest and largest eigenvalues of the Hessian. In order to achieve the two dimensional quadratic termination of the family, we derived a short stepsize \(\tilde{\alpha }_{k}\) (14) that converges to \(1/\lambda _{n}\). By using the long BB stepsize and \(\tilde{\alpha }_{k-1}\) in a new adaptive cyclic manner, we designed Algorithm 1 for unconstrained quadratic optimization. We proved that Algorithm 1 converges R-linearly at a rate of \(1-1/\kappa\). Our numerical results on minimizing quadratic functions indicate the efficiency of Algorithm 1 over other recent successful gradient methods.
By using the same arguments as those in the proof of Theorem 1, we find stepsizes \(\tilde{\beta }_{k}\) and \(\tilde{\gamma }_{k}\) are such that (37) and (38) achieve the two dimensional quadratic termination, respectively. For the n dimensional quadratic problem, by Theorem 1 in [5], we obtain
and
In addition, we have \(\lim _{k \rightarrow \infty } \hat{\alpha }_{k}=1/\lambda _{1}\). However, \(\hat{\alpha }_{k}\), \(\hat{\beta }_{k}\) and \(\hat{\gamma }_{k}\) would not be good approximations of \(1/\lambda _{1}\), for more details see [36]. It is worth noting that the stepsize proposed in [28] is a special case of \(\tilde{\beta }_{k}\) with \(\psi (A)=I\). Moreover, it is not difficult to prove that \(\tilde{\beta }_{k}\) and \(\tilde{\gamma }_{k}\) are short stepsizes in the sense \(\tilde{\beta }_{k}<c_{2}/c_{3}\) and \(\tilde{\gamma }_{k}<c_{1}/c_{2}\). Hence, we can replace \(\tilde{\alpha }_{k-1}\) in Algorithm 1 by \(\tilde{\beta }_{k-1}\) and \(\tilde{\gamma }_{k-1}\), which leads to two variants of Algorithm 1. Preliminary experimental results show that the two variants are competitive with Algorithm 1. Furthermore, we can obtain the same convergence results as Theorem 6.
The results of this paper show that the two dimensional quadratic termination property is useful for designing efficient gradient methods. It would be interesting to develop new algorithms for solving general unconstrained optimization problems based on such a property. We leave this as our future work.
6 Declarations
All data generated or analyzed during this study are included in this published article and are also available from the corresponding author on reasonable request.
References
Cauchy, A.: Méthode générale pour la résolution des systemes di’équations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
Forsythe, G.E.: On the asymptotic directions of the \(s\)-dimensional optimum gradient method. Numer. Math. 11(1), 57–76 (1968)
Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: On the asymptotic convergence and acceleration of gradient methods. J. Sci. Comput. 90, 7 (2022)
Nocedal, J., Sartenaer, A., Zhu, C.: On the behavior of the gradient norm in the steepest descent method. Comput. Optim. Appl. 22(1), 5–35 (2002)
Zou, Q., Magoulès, F.: Fast gradient methods with alignment for symmetric linear systems without using Cauchy step. J. Comput. Math. 381, 113033 (2021)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Raydan, M.: On the Barzilai-Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Dai, Y.H., Liao, L.Z.: \(R\)-linear convergence of the Barzilai-Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
Fletcher, R.: On the Barzilai–Borwein method. In: Optimization and control with applications, pp. 235–256. Springer, New York (2005)
Raydan, M.: The Barzilai-Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
Yuan, Y.X.: Step-sizes for the gradient method. AMS/IP Stud. Adv. Math. 42(2), 785–796 (2008)
Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
Birgin, E.G., Martínez, J.M., Raydan, M., et al.: Spectral projected gradient methods: review and perspectives. J. Stat. Softw. 60(3), 539–559 (2014)
Dai, Y.H., Huang, Y.K., Liu, X.W.: A family of spectral gradient methods for optimization. Comput. Optim. Appl. 74(1), 43–65 (2019)
Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal. 23(4), 707–716 (1986)
Huang, Y.K., Liu, H.: Smoothing projected Barzilai-Borwein method for constrained non-lipschitz optimization. Comput. Optim. Appl. 65(3), 671–698 (2016)
Huang, Y.K., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai-Borwein method for nonnegative matrix factorization. Data Min. Knowl. Discov. 29(6), 1665–1684 (2015)
Jiang, B., Dai, Y.H.: Feasible Barzilai-Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), 756–784 (2013)
Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Mang. Optim. 1(2), 181 (2005)
Huang, Y.K., Dai, Y.H., Liu, X.W.: Equipping the Barzilai-Borwein method with the two dimensional quadratic termination property. SIAM J. Optim. 31(4), 3068–3096 (2021)
Sun, C., Liu, J.P.: New stepsizes for the gradient method. Optim. Lett. 14(7), 1943–1955 (2020)
Dai, Y.H., Yang, X.: A new gradient method with an optimal stepsize property. Comput. Optim. Appl. 33(1), 73–88 (2006)
Elman, H.C., Golub, G.H.: Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal. 31(6), 1645–1661 (1994)
Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Mang. Optim. 4(2), 299–312 (2008)
De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comput. Optim. Appl. 59(3), 541–563 (2014)
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)
Crisci, S., Porta, F., Ruggiero, V., Zanni, L.: Spectral properties of Barzilai-Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds. SIAM J. Optim. 30(2), 1300–1326 (2020)
Dai, Y.H., Fletcher, R.: Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 100(1), 21–47 (2005)
Dai, Y.H., Hager, W.W., Schittkowski, K., Zhang, H.: The cyclic Barzilai-Borwein method for unconstrained optimization. IMA J. Numer. Anal. 26(3), 604–627 (2006)
Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: On the acceleration of the Barzilai-Borwein method. Comput. Optim. Appl. 81(3), 717–740 (2022)
Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comput. Optim. Appl. 35(1), 69–86 (2006)
Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: Gradient methods exploiting spectral properties. Optim. Method Softw. 35(4), 681–705 (2020)
Li, X., Huang, Y.K.: A note on the \(R\)-linear convergence of nonmonotone gradient methods. arXiv preprint, arXiv:2207.05912 (2022)
Dai, Y.H.: Alternate step gradient method. Optimization 52(4–5), 395–415 (2003)
Huang, N.: On \(R\)-linear convergence analysis for a class of gradient methods. Comput. Optim. Appl. 81(1), 161–177 (2022)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 11701137) and Natural Science Foundation of Hebei Province (Grant No. A2021202010).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Huang, Y. A gradient method exploiting the two dimensional quadratic termination property. Optim Lett 17, 1413–1434 (2023). https://doi.org/10.1007/s11590-022-01936-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-022-01936-z