Abstract
We consider the asymptotic behavior of a family of gradient methods, which include the steepest descent and minimal gradient methods as special instances. It is proved that each method in the family will asymptotically zigzag between two directions. Asymptotic convergence results of the objective value, gradient norm, and stepsize are presented as well. To accelerate the family of gradient methods, we further exploit spectral properties of stepsizes to break the zigzagging pattern. In particular, a new stepsize is derived by imposing finite termination on minimizing two-dimensional strictly convex quadratic function. It is shown that, for the general quadratic function, the proposed stepsize asymptotically converges to the reciprocal of the largest eigenvalue of the Hessian. Furthermore, based on this spectral property, we propose a periodic gradient method by incorporating the Barzilai-Borwein method. Numerical comparisons with some recent successful gradient methods show that our new method is very promising.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The gradient method is well-known for solving the following unconstrained optimization
where \(f: \mathbb {R}^n \rightarrow \mathbb {R}\) is continuously differentiable, especially when the dimension n is large. In particular, at the k-th iteration gradient methods update the iterates by
where \(g_k=\nabla f(x_k)\) and \(\alpha _k>0\) is the stepsize determined by the method.
One simplest nontrivial nonlinear instance of (1) is the quadratic optimization
where \(b\in \mathbb {R}^n\) and \(A\in \mathbb {R}^{n\times n}\) is symmetric and positive definite. Solving (3) efficiently is usually a pre-requisite for a method to be generalized to solve more general optimization. In addition, by Taylor’s expansion, a general smooth function can be approximated by a quadratic function near the minimizer. So, the local convergence behaviors of gradient methods are often reflected by solving (3). Hence, in this paper, we focus on studying the convergence behaviors and propose efficient gradient methods for solving (3) efficiently.
The classic steepest descent (SD) method proposed by Cauchy [4] solves (3) by using the exact stepsize
Although \(\alpha _k^{SD}\) minimizes f along the steepest descent direction, the SD method often performs poorly in practice and has linear converge rate [1, 18] as
where \(f^*\) is the optimal function value of (3) and \(\kappa =\lambda _n/\lambda _1\) is the condition number of A with \(\lambda _1\) and \(\lambda _n\) being the smallest and largest eigenvalues of A, respectively. Thus, if \(\kappa \) is large, the SD method may converge very slowly. In addition, Akaike [1] proved that the gradients will asymptotically alternate between two directions in the subspace spanned by the two eigenvectors corresponding to \(\lambda _1\) and \(\lambda _n\). So, the SD method often has zigzag phenomenon near the solution. Forsythe [18] generalized Akaike’s results to the so-called optimum s-gradient method and Pronzato et al. [27] further generalized the results to the so-called P-gradient method in the Hilbert space. Recently, by employing Akaike’s results, Nocedal et al. [26] presented some insights for asymptotic behaviors of the SD method on function values, stepsizes and gradient norms.
Contrary to the SD method, the minimal gradient (MG) method (see Dai and Yuan [10]) computes its stepsize by minimizing the gradient norm,
where \(\Vert \cdot \Vert \) is the 2-norm. It is widely accepted that the MG method can also perform poorly and has similar asymptotic behavior as the SD method, i.e., it will asymptotically zigzag in a two-dimensional subspace. Zhou et al. [33] provided some interesting analyses on \(\alpha _k^{MG}\) for minimizing two-dimensional quadratics. However, rigorous asymptotic convergence results of the MG method for minimizing general quadratic functions are very limit in literature.
In order to avoid the zigzagging pattern, it is useful to determine the stepsize without using the exact stepsize because it would yield a gradient perpendicular to the current one. Barzilai and Borwein [2] proposed the following two novel stepsizes:
where \(s_{k-1}=x_k-x_{k-1}\) and \(y_{k-1}=g_k-g_{k-1}\). The Barzilai-Borwein (BB) method (7) performs quite well in practice, though it generates a nonmonotone sequence of objective values. Due to its simplicity and efficiency, the BB method has been widely studied [6,7,8, 17, 28] and extended to general problems and various applications [3, 22,23,24,25, 29]. Another line of research to break the zigzagging pattern and accelerate the convergence is occasionally applying short stepsizes that approximate \(1/\lambda _n\) to eliminate the corresponding component of the gradient. One seminal work is due to Yuan [31, 32], who derived the following stepsize:
Dai and Yuan [11] further suggested a new gradient method with
The Dai-Yuan (DY) method (9) is a monotone method and appears very competitive with the nonmonotone BB method. Recently, by employing the results in [1] and [26], De Asmundis et al. [12] show that the stepsize \(\alpha _k^{Y}\) converges to \(1/\lambda _n\) if the SD method is applied to problem (3). This spectral property is the key to break the zigzagging pattern.
In [9], Dai and Yang developed the asymptotic optimal gradient (AOPT) method whose stepsize is given by
Unlike the DY method, the AOPT method only has one stepsize. In addition, they show that \(\alpha _k^{AOPT}\) asymptotically converges to \(\frac{2}{\lambda _1+\lambda _n}\), which is in some sense an optimal stepsize since it minimizes \(\Vert I-\alpha A\Vert \) over \(\alpha \) [9, 16]. However, the AOPT method also asymptotically alternates between two directions. To accelerate the AOPT method, Huang et al. [21] derived a new stepsize that converges to \(1/\lambda _n\) during the AOPT iterates and further suggested a gradient method to exploit spectral properties of the stepsizes. For the latest developments of exploiting spectral properties to accelerate gradient methods, see [12,13,14, 20, 21] and references therein.
In this paper, we present the analysis on the asymptotic behaviors of gradient methods and the techniques for breaking the zigzagging pattern. For a uniform analysis, we consider the following stepsize
where \(\varPsi \) is a real analytic function on \([\lambda _1,\lambda _n]\) and can be expressed by Laurent series
such that \(0<\sum _{k=-\infty }^\infty c_kz^k<+\infty \) for all \(z\in [\lambda _1,\lambda _n]\). Apparently, \(\alpha _k\) is a family of stepsizes that would give a family of gradient methods. When \(\varPsi (A)=A^u\) for some nonnegative integer u, we get the following stepsize
The \(\alpha _k^{SD}\) and \(\alpha _k^{MG}\) simply correspond to the cases \(u=0\) and \(u=1\), respectively.
We will present theoretical analysis on the asymptotic convergence on the family of gradient methods whose stepsize can be written in the form (11), which provides justifications for the zigzag behaviors of all these gradient methods including the SD and MG methods. In particular, we show that each method in the family (11) will asymptotically alternate between two directions associated with the two eigenvectors corresponding to \(\lambda _1\) and \(\lambda _n\). Moreover, we analyze the asymptotic behaviors of the objective value, gradient norm, and stepsize. It is shown that, when \(\varPsi (A)\ne I\), the two sequences \(\Big \{\frac{\varDelta _{2k+1}}{\varDelta _{2k}} \Big \}\) and \(\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k+1}} \Big \}\) may converge at different speeds, while the odd and even subsequences \(\Big \{\frac{\varDelta _{2k+3}}{\varDelta _{2k+1}}\Big \}\) and \(\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k}}\Big \}\) converge at the same rate, where \(\varDelta _k = f(x_k) - f^* \). Similar property is also possessed by the gradient norm sequence. In addition, each method in the family (11) has the same worst asymptotic convergence rate.
In order to accelerate the gradient methods (11), we investigate techniques for breaking the zigzagging pattern. We derive a new stepsize \(\tilde{\alpha }_k\) based on finite termination for minimizing two-dimensional strictly convex quadratic functions. For the n-dimensional case, we prove that \(\tilde{\alpha }_k\) converges to \(1/\lambda _n\) when gradient methods (11) are applied to problem (3). Furthermore, based on this spectral property, we propose a periodic gradient method, which, in a periodic mode, alternately uses the BB stepsize, stepsize (11) and our new stepsize \(\tilde{\alpha }_k\). Numerical comparisons of the proposed method with the BB [2], DY [11], ABBmin2 [19], SDC [12] methods and Alg. 1 in [30] show that the new gradient method is very promising. Our theoretical results also significantly improve and generalize those in [1] and [26], where only the SD method (i.e., \(\varPsi (A)=I\)) is considered. We point out that [27] does not analyze the asymptotic behaviors of the objective value, gradient norm, and stepsize, though (11) is similar to their P-gradient method. Moreover, we develop techniques for accelerating these zigzag methods with simpler analysis. Notice that \(\alpha _k^{AOPT}\) can not be written in the form (11). Thus, our results are not applicable to the AOPT method. On the other hand, the analysis of the AOPT method presented by [9] can not be applied directly to the family of methods (11).
The paper is organized as follows. In Sect. 2, we analyze the asymptotic behaviors of the family of gradient methods (11). In Sect. 3, we accelerate the gradient methods (11) by developing techniques to break the zigzagging pattern and propose a new periodic gradient method. Numerical experiments are presented in Sect. 4. Finally, some conclusions and discussions are made in Sect. 5.
2 Asymptotic Behavior of the Family (11)
In this section, we present a uniform analysis on the asymptotic behavior of the family of gradient methods (11) for general n-dimensional strictly convex quadratics.
Let \(\{\lambda _1,\lambda _2,\cdots ,\lambda _n\}\) be the eigenvalues of A, and \(\{\xi _1,\xi _2,\ldots ,\xi _n\}\) be the associated orthonormal eigenvectors. Note that the gradient method is invariant under translations and rotations when applying to a quadratic function. As pointed out by Fletcher in [17], we can combine those gradient components if there are any multiple eigenvalues. Thus, for theoretical analysis, we assume without loss of generality that
Denoting the components of \(g_k\) along the eigenvectors \(\xi _i\) by \(\mu _k^{(i)}\), \(i=1,\ldots ,n\), i.e.,
The above decomposition of gradient \(g_k\) together with the update rule (2) gives that
where
Defining the vector \(q_k=\left( q_k^{(i)}\right) \) with
and
we can have from (16), (17) and (18) that
In addition, by the definition of \(q_k\), we know that \(q_k^{(i)}\ge 0\) for all i and
Before establishing the asymptotic convergence of the family of gradient methods (11), we first give some lemmas on the properties of the sequence \(\{q_k\}\).
Lemma 1
Suppose \(p\in \mathbb {R}^n\) satisfies (i) \(p^{(i)}\ge 0\) for all \(i=1,2,\ldots ,n\); (ii) there exist at least two \(i's\) with \(p^{(i)}>0\); and (iii) \(\sum _{i=1}^np^{(i)}=1\). Define \(T: \mathbb {R}^n\rightarrow \mathbb {R}\) be the following transformation:
where
Then we have
where
In addition, (22) holds with equality if and only if there are two indices, say \(i_1\) and \(i_2\), such that \(p^{(i)}=0\) for all \(i\notin \{i_1,i_2\}\) and
Proof
It follows from the definition of Tp that
Let us define two vectors \(w = (w_i) \in \mathbb {R}^n\) and \(z = (z_i) \in \mathbb {R}^n\) by
and
Then, we have from the Cauchy-Schwarz inequality that
Using the definition of \(\gamma (p)\), we can obtain that
which together with (26) gives
Then, the inequality (22) follows immediately.
The equality in (26) holds if and only if
for some nonzero scalar C. Clearly, (29) holds when \(p^{(i)}=0\). Suppose that there exist two indices \(i_1\) and \(i_2\) such that \(p^{(i_1)},p^{(i_2)}>0\). It follows from (29) that
So, by the assumption (13), we have
which again with assumption (13) imply that (29) holds if and only if p has only two nonzero components and (24) holds. \(\square \)
Lemma 2
Let \(p_*\in \mathbb {R}^n\) satisfy the conditions of Lemma 1 and T be the transformation (20). If \(p_*\) has only two nonzero components \(p_*^{(i_1)}\) and \(p_*^{(i_2)}\), we have
and
where the function \(\gamma \) is defined in (21). Moreover, \(p_*=Tp_*\) if and only if
Proof
By the definition of \(\gamma (p)\), we have
which indicates that
and
Then, it follows from the definition of transformation T that
This gives (30). Equation (31) can be proved similarly. By (30) and (31), we have
The result of \((T^2p_*)^{(i_2)}\) follows similarly. This proves (32).
Again by (30), (31) and the definition of function \(\gamma \) in (21), we have
Then, the equality (33) follows from (35) and (36). For (34), let
Rearranging terms and using \(p_*^{(i_1)}+p_*^{(i_2)}=1\), we have
which implies that
This together with the fact \(p_*^{(i_1)}+p_*^{(i_2)}=1\) yields (34). \(\square \)
Lemma 3
Let \(p\in \mathbb {R}^n\) satisfy the conditions of Lemma 1 and T be the transformation (20). Then, there exists a \(p_*\) satisfying
where \(p_*\) and \(Tp_*\) have only two nonzero components satisfying
for some \(i_1,i_2\in \{1,\ldots ,n\}\). Hence, (30), (31), (32) and (33) hold.
Proof
Let \(p_0=T^0p=p\) and \(p_{k}=T{p_{k-1}}=T^{k}{p_0}\). Obviously, for all \(k\ge 0\), \(p_{k}\) satisfies (i) and (iii) of Lemma 1. Let \(i_{\min } = \min \{i \in \mathcal {N}: p_0^{(i)}>0\}\) and \(i_{\max } = \max \{i \in \mathcal {N}: p_0^{(i)}>0\}\), where \(\mathcal {N} = \{1, \ldots , n\}\). From the definition of \(\gamma \), we know \(\lambda _{i_{\min }}< \gamma (p)<\lambda _{i_{\max }}\). Thus, by the definition of T, we have \(p_1^{({i_{\min }})}>0\) and \(p_1^{({i_{\max }})}>0\). Then, by induction, for all \(k\ge 0\), \(p_{k}\) satisfies (ii) of Lemma 1. So, by Lemma 1, \(\{\varTheta (p_k)\}\) is a monotonically increasing sequence. Since \(\lambda _1\le \gamma (p)\le \lambda _n\), we have \((\lambda _i-\gamma (p))^2\le (\lambda _n-\lambda _1)^2\). Hence, we have from the definition of \(\varTheta \) that \(\varTheta (p_k)\le (\lambda _n-\lambda _1)^2 \). Thus, \(\{\varTheta (p_k)\}\) is convergent. Let \(\varTheta _*=\lim _{k\rightarrow \infty }\varTheta (p_k) >0\).
Denote the set of all limit points of \(\{p_k\}\) by \(P_*\) with cardinality \(|P_*|\). Since \(\{p_k\}\) is bounded, \(|P_*|\ge 1\). For any subsequence \(\{p_{k_j}\}\) converging to some \(p_*\in P_*\), we have
by the continuity of \(\varTheta \) and T. Notice \(p_{k_j+1}=Tp_{k_j}\), we have \( \varTheta _*=\varTheta (p_*)=\varTheta (Tp_*)\).
Since \(p_{k}\) satisfies (i)–(iii) of Lemma 1 for all \(k\ge 0\), \(p_*\) must satisfy (i) and (iii). If \(p_*\) has only one positive component, we have \(\varTheta (p_*)=0\) which contradicts \(\varTheta (p_*) = \varTheta _* > 0 \). Hence, by Lemmas 1, 2 and \(\varTheta (p_*)=\varTheta (Tp_*)\), \(p_*\) has only two nonzero components, say \(p_*^{(i_1)}\) and \(p_*^{(i_2)}\), and their values are uniquely determined by the indices \(i_1\), \(i_2\) and the eigenvalues \(\lambda _{i_1}\) and \(\lambda _{i_2}\). This implies \(|P_*| < \infty \). Furthermore, by Lemma 2, for any \(p_*\in P_*\), \(Tp_*\) is given by (30) and (31), and \(Tp_*\in P_*\).
We now show that \(|P_*|\le 2\) by way of contradiction. Suppose \(|P_*| \ge 3 \). For any \(p_*\in P_*\) and \(Tp_*\in P_*\), denote \(\delta _1\) and \(\delta _2\) to be the distance from \(p_*\) to \(P_*\setminus \{p_*\}\) and from \(Tp_*\) to \(P_*\setminus \{Tp_*\}\), respectively. Since \( 3 \le |P_*| < \infty \), we have \(\delta _1>0\), \(\delta _2>0\) and there exists an infinite subsequence \(\{p_{k_j}\}\) such that
but \(p_{k_j+2}\notin \mathcal {B}\left( p_*,\frac{1}{2}\delta \right) \cup \mathcal {B}\left( Tp_*,\frac{1}{2}\delta \right) \), where \(\delta =\min \{\delta _1,\delta _2\}\) and \(\mathcal {B}(p_*,r)=\{p: \Vert p-p_*\Vert \le r\}\). However, by (32) we have \(T^2p_*=p_*\). Hence, by continuity of T,
which contradicts the choice of \(p_{k_j+2}\notin \mathcal {B}\left( p_*,\frac{1}{2}\delta \right) \). Thus, \(\{p_k\}\) has at most two limit points \(p_*\) and \(Tp_*\), and both have only two nonzero components.
Now, we assume that \(p_*\) is a limit point of \(\{p_{2k}\}\). Since \(T^2p_*=p_*\), all subsequences of \(\{p_{2k}\}\) have the same limit point, i.e., \(p_{2k}=T^{2k}p\rightarrow p_*\). Similarly, we have \(T^{2k+1}p\rightarrow Tp_*\). Then, (38) and (39) follow directly from the analysis. \(\square \)
Based on the above analysis, we can show that each gradient method in (11) will asymptotically reduces its search in a two-dimensional subspace spanned by the two eigenvectors \(\xi _1\) and \(\xi _n\).
Theorem 1
Assume that the starting point \(x_0\) has the property that
Let \(\{x_k\}\) be the iterations generated by applying a method in (11) to solve problem (3). Then
and
where c is a nonzero constant and can be determined by the limit in (48).
Proof
By the assumption (40), we know that \(q_0\) satisfies (i)–(iii) of Lemma 1. Notice that \(q_k = T^k q_0\). Then, by Lemma 3, there exists a \(p_*\) such that the sequences \(\{q_{2k}\}\) and \(\{q_{2k+1}\}\) converge to \(p_*\) and \(Tp_*\), respectively, which have only two nonzero components satisfying (38), (39) for some \(i_1,i_2\in \{1,\ldots ,n\}\), and (32) holds. Hence, if \(1< i_1<i_2<n\), we have
and
In addition, since \(q_0^{(1)} > 0 \) and \(q_0^{(n)} > 0\) by (40), we can see from the proof of Lemma 3 that \(q_k^{(1)}>0\), \(q_k^{(n)}>0\) for all \(k\ge 0\). Thus, we have
where the inequality is due to \(\lambda _{i_1}< \gamma (p_*),\gamma (Tp_*) < \lambda _{i_2}\) and \(\lambda _n > \lambda _{i_2}\). So, \(q_{2k}^{(n)}\rightarrow +\infty \), which contradicts (43). Then, we must have \(i_2=n\). In a similar way, we can show that \(i_1=1\). Finally, the equalities in (41) and (42) follow directly from Lemma 2. \(\square \)
In the following, we refer c as the same constant in Theorem 1. By Theorem 1 we can directly obtain the asymptotic behavior of the stepsize.
Corollary 1
Under the conditions of Theorem 1, we have
and
where \(\alpha _k\) is defined in (11) and \(\kappa = \lambda _n/\lambda _1\) is the condition number of A. Moreover,
The next corollary interprets the constant c. A special result for the case \(\varPsi (A)=I\) (i.e., the SD method) can be found in Lemma 3.4 of [26].
Corollary 2
Under the conditions of Theorem 1, we have
Proof
It follows from Theorem 1 that
Note that \(1/\lambda _n<\alpha _k<1/\lambda _1\) by the assumption (40). And we have by (16) that
and
Thus, the sequence \(\Big \{\frac{\mu _{2k}^{(n)}}{\mu _{2k}^{(1)}}\Big \}\), and similarly for \(\Big \{\frac{\mu _{2k+1}^{(1)}}{\mu _{2k+1}^{(n)}}\Big \}\), do not change its sign. Hence, without loss of generality, we can assume by (49) that
Then, by (16), (45) and (50), we have
which gives (48). \(\square \)
We have the following results on the asymptotic convergence of the function value.
Theorem 2
Under the conditions of Theorem 1, we have
where
In addition, if \(\varPsi (\lambda _{n})=\varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\), then \(R_f^1=R_f^2\).
Proof
Let \(\epsilon _k=x_k-x^*\). Since \(g_k=A\epsilon _k\), by (14), we have
By Theorem 1, we only need to consider the case \(\mu _k^{(i)}=0\), \(i=2,\ldots ,n-1\), that is,
Thus,
Since
by the definition of \(\epsilon _k\) and the update rule (2), we further have that
Hence, we obtain
Combining (54) with (55) yields that
which gives (51) by substituting the limits of \((\mu _k^{(1)})^2\) and \(\left( \mu _k^{(n)}\right) ^2\) in Theorem 1.
Notice \(\kappa >1\) by our assumption. So, \(R_f^1=R_f^2\) is equivalent to
which by rearranging terms gives
Hence, \(R_f^1=R_f^2\) holds if \(\varPsi (\lambda _{n})=\varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\). \(\square \)
Remark 1
Theorem 2 indicates that, when \(\varPsi (A)=I\) (i.e., the SD method), the two sequences \(\Big \{\frac{\varDelta _{2k+1}}{\varDelta _{2k}} \Big \}\) and \(\Big \{ \frac{\varDelta _{2k+2}}{\varDelta _{2k+1}} \Big \}\) converge at the same speed, where \(\varDelta _k = f(x_k) - f^*\). Otherwise, the two sequences may converge at different rates.
To illustrate the results in Theorem 2, we apply gradient method (11) with \(\varPsi (A)=A\) (i.e., the MG method) to an instance of (3), where the vector of all ones was used as the initial point, the matrix A is diagonal with
and \(b=0\). Fig. 1 clearly shows the difference between \(R_f^1\) and \(R_f^2\).
The next theorem shows the asymptotic convergence of the gradient norm.
Theorem 3
Under the conditions of Theorem 1, the following limits hold,
where
In addition, if \(\varPsi (\lambda _{n})=\kappa \varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\), then \(R_g^1=R_g^2\).
Proof
Using the same arguments as in Theorem 2, we have
and
which give that
Thus, (57) follows by substituting the limits of \((\mu _k^{(1)})^2\) and \((\mu _k^{(n)})^2\) in Theorem 1.
Notice \(\kappa >1\) by our assumption. So, \(R_g^1=R_g^2\) is equivalent to
which by rearranging terms gives
Hence, \(R_g^1=R_g^2\) holds if \(\varPsi (\lambda _{n})=\kappa \varPsi (\lambda _{1})\) or \(c^2=\varPsi (\lambda _{1})/\varPsi (\lambda _{n})\). \(\square \)
Remark 2
Theorem 3 indicates that the two sequences \(\Big \{\frac{\Vert g_{2k+1}\Vert ^2}{\Vert g_{2k}\Vert ^2}\Big \}\) and \(\Big \{\frac{\Vert g_{2k+2}\Vert ^2}{\Vert g_{2k+1}\Vert ^2}\Big \}\) generated by the method (11) with \(\varPsi (A)=A\) (i.e., the MG method) converge at the same rate. Otherwise, the two sequences may converge at different rates.
By Theorems 2 and 3, we can obtain the following corollary.
Corollary 3
Under the conditions of Theorem 1, we have
In addition,
Remark 3
Corollary 3 shows that the odd and even subsequences of objective values and gradient norms converge at the same rate. Moreover, we have
where \(t= c^2 \varPsi (\lambda _{n})/\varPsi (\lambda _{1})\). Notice that the right side of (63) only depends on \(\kappa \), which implies these odd and even subsequences generated by all the gradient methods (11) will have the same worst asymptotic rate independent of \(\varPsi \).
Now, as in [26], we define the minimum deviation
where
Clearly, \(\sigma \in (0,1)\). We now close this section by deriving a bound on the constant c defined in Theorem 1. The following theorem generalizes the results in [1] and [26], where only the case \(\varPsi (A)=I\) (i.e., the SD method) is considered.
Theorem 4
Under the conditions of Theorem 1, and assuming that \(\mathcal {I}\) is nonempty, we have
where
Proof
Let \(p=q_0\). By the definition of T, we have that
It follows from Theorem 1 and Lemma 3 that
By the continuity of T and (37) in Lemma 3, we always have that
which together with (67) and (68) implies that
where \(p_*\) is the same vector as in Lemma 3. Clearly, (69) also holds for \(i=1\). As for \(i=n\), it follows from (33) in Lemma 2 and Theorem 1 that
which yields that
Thus, (69) holds for \(i=1,\ldots ,n\). Hence, we have
where \(\delta =\frac{\lambda _1+\lambda _n}{2}\). By (70) and (71), we obtain
which implies that
By Lemma 2 and Theorem 1, we have that
Substituting \(\gamma (p_*)\) into (72), we obtain
which gives
where \(\sigma _i=\frac{2 \lambda _i-(\lambda _1+\lambda _n)}{\lambda _n-\lambda _1}\). Noting that (73) holds for all \(i\in \mathcal {I}\). Thus, we have
which implies (65). This completes the proof. \(\square \)
3 Techniques for Breaking the Zigzagging Pattern
As shown in the previous section, all the gradient methods (11) asymptotically conduct the searches in the two-dimensional subspace spanned by \(\xi _1\) and \(\xi _n\). By (16), if either \(\mu _k^{(1)}\) or \(\mu _k^{(n)}\) equals to zero, the corresponding component will vanish at all subsequent iterations. Hence, in order to break the undesired zigzagging pattern, a good strategy is to employ some stepsize approximating \(1/\lambda _1\) or \(1/\lambda _n\). In this section, we will derive a new stepsize converging to \(1/\lambda _n\) and propose a periodic gradient method using this new stepsize.
3.1 A New Stepsize
Our new stepsize will be derived by imposing finite termination on minimizing two-dimensional strictly convex quadratic function, see [31] for the case of \(\varPsi (A)=I\) (i.e., the SD method).
To proceed, consider the quadratic optimization (3) with \(n=2\). Suppose \(x_3\) is the minimizer after the following 3 iterations:
where \(g_i \ne 0\), \(i=0,1,2\), \(\alpha _0\) and \(\alpha _2\) are stepsizes given by (11), and \(\alpha _1\) is a stepsize to be derived.
By the stepsize definition (11), we have
Hence, for any given \(r \in \mathbb {R}\), \(\Big \{\frac{\varPsi ^r(A)g_0}{\Vert \varPsi ^r(A)g_0\Vert }, \frac{\varPsi ^{1-r}(A)g_1}{\Vert \varPsi ^{1-r}(A)g_1\Vert }\Big \}\) forms an orthonormal basis of \(\mathbb {R}^2\). Then, for all \(x\in \mathbb {R}^2\), \(x-x_1\) can be expressed by this basis. Suppose there exist \(t,l\in \mathbb {R}\) such that
We expand f(x) at \(x_1\) and obtain
where
and
where last equality is due to \(g_0 ^T \varPsi (A)Ag_1=-\frac{g_1 ^T \varPsi (A)g_1}{\alpha _0}\) which follows from \(g_1=g_0-\alpha _0Ag_0\) and (75). Note that \(B ^T B = B B ^T = I\). Then H is positive definite and has the same eigenvalues as A. Consequently, the minimizer of \(\varphi \) is given by \(\begin{pmatrix} t^* \\ l^* \\ \end{pmatrix} =-H^{-1}\vartheta \).
When \(x_3\) is the solution, we must have
Due to the fact that \(x_3 = x_2 - \alpha _2 g_2\), we know \(x_3-x_2\) is parallel to \(g_2\). It follows from (79) and \(x_2 = x_1 - \alpha _1 g_1\) that
which, by noting \(g_2 = g_1 - \alpha _1 Ag_1\) and expressing the two vectors in (80) by the above basis, is equivalent to
Denote the components of \(\vartheta \) by \(\vartheta _i\), and the components of H by \(H_{ij}\), \(i,j=1,2\). By (81), we would have
are parallel, where \(\zeta =\text {det}(H) = \text {det}(A) >0\). It follows that
which gives
where
On the other hand, if (82) holds, we have (80) holds, which by (77), \(H^{-1} = B A^{-1} B ^T\) and \(B ^T B=I\) implies that
is parallel to \(g_2\). Hence, \(g_2\) is an eigenvector of A, i.e. \(Ag_2 = \lambda g_2\) for some \(\lambda > 0\), since \(g_2 \ne 0\). So, by (11), we get \(\alpha _2 = \varPsi (\lambda ) g_2 ^T g_2 /(\lambda \varPsi (\lambda ) g_2 ^T g_2 ) = 1/\lambda \). Therefore, \(g_3 = g_2 - \alpha _2 A g_2 = g_2 - \alpha _2 \lambda g_2 = 0\), which implies \(x_3\) is the solution. Thus, (82) guarantees \(x_3\) is the minimizer.
Based on the above analysis, to ensure \(x_3\) is the minimizer, we only need to choose \(\alpha _1\) satisfying
whose two positive roots are
These two roots are \(1/\lambda _1\) and \(1/\lambda _2\), where \(0< \lambda _1 < \lambda _2\) are two eigenvalues of A. For numerical reasons (see next subsection), we would like to choose \(\alpha _1\) to be the smaller one \(1/\lambda _2\), which can be calculated as
For general case, since \(g_{k-1} ^T \varPsi (A)g_{k}=0\) we could propose our new stepsize at the k-th iteration as
where \(H_{ij}^k\) is the component of \(H^k\):
and \(\alpha _{k-1}\) is given by (11). Clearly, \(\alpha _k^Y\) in (8) can be obtained by setting \(\varPsi (A)=I\) in (86). In addition, by (85) we have that
3.2 Spectral Property of the New Stepsize
The next theorem shows that the stepsize \(\tilde{\alpha }_k\) enjoys desirable spectral property.
Theorem 5
Suppose that the conditions of Theorem 1 hold. Let \(\{x_k\}\) be the iterations generated by any gradient method in (11) to solve problem (3). Then
Proof
It follows from (41) and (42) of Theorem 1 that
and
which give
and
Then, by the definition of \(\alpha _k\), we have
which together with (42) in Theorem 1 and (46) in Corollary 1 yields that
Then, from the above equality and (90), we obtain that
Combining (89) and (91), we have that
This completes the proof. \(\square \)
Remark 4
When \(r=1\), we have from (87) that \(\tilde{\alpha }_k \le 1/H_{22}^k = \alpha _k^{SD}\). Hence, using the stepsize \(\tilde{\alpha }_k\) will give a monotone gradient method. Theorem 5 indicates that the general \(\tilde{\alpha }_k\) will have the asymptotic spectral property (88), and hence will be asymptotically be smaller than \(\alpha _k^{SD}\) independent of r. But a proper choice r will facilitate the calculation of \(\tilde{\alpha }_k\). This will be more clear in the next section.
Using the similar arguments, we can also show the larger stepsize derived in Sect. 3.1 converges to \(1/\lambda _1\).
Theorem 6
Let
Under the conditions of Theorem 5, we have
To present an intuitive illustration of the asymptotic behaviors of \(\tilde{\alpha }_k\) and \(\bar{\alpha }_k\), we applied the gradient method (11) with \(\varPsi (A)=A\) (i.e., the MG method) to minimize the quadratic function (3) with
where \(a_1=1\), \(a_n=n\) and \(a_i\) is randomly generated between 1 and n for \(i=2,\ldots ,n-1\). From Fig. 2, we can see that \(\tilde{\alpha }_k\) approximates \(1/\lambda _n\) with satisfactory accuracy in a few iterations. However, \(\bar{\alpha }_{k}\) converges to \(1/\lambda _1\) even slower than the decreasing of gradient norm. This, to some extent, explains the reason why we prefer \(\tilde{\alpha }_k\) to the short stepsize.
3.3 A Periodic Gradient Method
A method alternately using \(\alpha _k\) in (11) and \(\tilde{\alpha }_k\) to minimize a two-dimensional quadratic function will monotonically decrease the objective value, and terminates in 3 iterations. However, for minimizing a general n-dimensional quadratic function, this alternating scheme may not be efficient for the purpose of vanishing the component \(\mu _k^{(n)}\). One possible reason is that, as shown in Fig. 2, it needs tens of iterations for \(\tilde{\alpha }_k\) to approximate \(1/\lambda _n\) with satisfactory accuracy. In what follows, by incorporating the BB method, we develop an efficient periodic gradient method using \(\tilde{\alpha }_k\).
Figure 3 illustrates a comparison of the gradient method (11) using \(\varPsi (A)=A\) (i.e., the MG method) with a method using 20 BB2 steps first and then MG steps on solving problem (56). We can see that by using some BB2 steps, the modified MG method is accelerated and the stepsize \(\tilde{\alpha }_k\) will approximate \(1/\lambda _n\) with a better accuracy. Thus, our method will run some BB steps first. Now, we investigate the affect of reusing a short stepsize on the performance of the gradient method (11). Suppose that we have a good approximation of \(1/\lambda _n\), say \(\alpha =\frac{1}{\lambda _n+10^{-6}}\). We compare MG method with its two variants by applying (i) \(\alpha _0=\alpha \) or (ii) \(\alpha _0=\ldots =\alpha _9=\alpha \) before using the MG stepsize. Figure 4 shows that reusing \(\alpha \) will accelerate the MG method. Hence, we prefer to reuse \(\tilde{\alpha }_k\) for some consecutive steps when \(\tilde{\alpha }_k\) is a good approximation of \(1/\lambda _n\). Finally, our new method is summarized in Algorithm 1, which periodically applies the BB stepsize, \(\alpha _k\) in (11) and \(\tilde{\alpha }_k\). The R-linear global convergence of Algorithm 1 for solving (3) can be established by showing that it satisfies the property in [5], see the proof of Theorem 3 in [7] for example.
Remark 5
The BB steps in Algorithm 1 can either employ the BB1 or BB2 stepsize in (7). The idea of using short stepsizes to eliminate the component \(\mu _k^{(n)}\) has been investigated in [12, 13, 20]. However, these methods are based on the SD method, i.e., occasionally applying short steps during the iterates of the SD method. One exception is given by [21], where a method is developed by employing new stepsizes during the iterates of the AOPT method. But our method periodically uses three different stepsizes: the BB stepsize, stepsize (11) and the new stepsize \(\tilde{\alpha }_k\).
The following example shows that both BB steps and \(\tilde{\alpha }_k\)-steps are important to the efficiency of our method. In particular, we consider \(\varPsi (A)=I\) for \(\alpha _k\) in (11) and \(\tilde{\alpha }_k\). We applied Algorithm 1 to a 1000 dimensional quadratic problem in the form (92) with \(a_i=11i-10\) (see [11]). The iteration was stopped once the gradient norm reduces by a factor of \(\epsilon \). Five different values of the parameter pair \((K_b,K_m,K_s)\) were tested. Table 1 presents average number of iterations over ten different initial points randomly chosen from \([-10,10]\) required by the method to meet a given tolerance. We see that, when Algorithm 1 takes no BB steps, i.e. using (0, 60, 10), it performs better than using (10, 0, 0) for a tight tolerance. Notice that Algorithm 1 with (10, 0, 0) reduces to the BB method with \(\alpha _k^{BB1}\). If Algorithm 1 takes no \(\tilde{\alpha }_k\)-steps, i.e. using (50, 60, 0), it is comparable to the case (0, 60, 10). However, when both BB steps and \(\tilde{\alpha }_k\)-steps are taken, i.e. using (50, 60, 10), Algorithm 1 clearly outperforms the other choices especially for a tight tolerance.
4 Numerical Experiments
In this section, we present numerical comparisons of Algorithm 1 and the following methods: BB with \(\alpha _k^{BB1}\) [2], DY [11], ABBmin2 [19], SDC [12], and Alg. 1 in [30] (A1 for short).
Notice that the stepsize rule for Algorithm 1 can be written as
where \(\alpha _k^{BB}\) can be either \(\alpha _k^{BB1}\) or \(\alpha _k^{BB2}\), \(\alpha _{k}(\varPsi (A))\) and \(\tilde{\alpha }_{k}(\varPsi (A))\) are the stepsizes given by (11) and (85), respectively. We tested the following four variants of Algorithm 1 using combinations of the two BB stepsizes and \(\varPsi (A)=I\) or A:
-
BB1SD: \(\alpha _k^{BB1}\) and \(\varPsi (A)=I\) in (93)
-
BB1MG: \(\alpha _k^{BB1}\) and \(\varPsi (A)=A\) in (93)
-
BB2SD: \(\alpha _k^{BB2}\) and \(\varPsi (A)=I\) in (93)
-
BB2MG: \(\alpha _k^{BB2}\) and \(\varPsi (A)=A\) in (93)
Now we derive a formula for the case \(\varPsi (A)=A\), i.e., \(\alpha _{k}(\varPsi (A))=\alpha _{k}^{MG}\). If we set \(r=0\), by (85), we have
which is expensive to compute directly. However, if we set \(r=1/2\), we get
This formula can be computed without additional cost because \(g_{k-1} ^T A g_{k-1}\) and \(g_k ^T A g_k\) have been obtained when computing the stepsizes \(\alpha _{k-1}^{MG}\) and \(\alpha _k^{MG}\).
All the methods in consideration were implemented in Matlab (v.9.0-R2016a) and carried out on a PC with an Intel Core i7, 2.9 GHz processor and 8 GB of RAM running Windows 10 system. We stopped the algorithm if the number of iteration exceeds 20,000 or the gradient norm reduces by a factor of \(\epsilon \).
We have tested five different sets of problems in the form of (92) with spectrum distributions described in Table 2, see [21, 30] for example. The rand function in Matlab was used for generating Hessian of the first three problem sets. To investigate performances of the compared methods on different values of Hessian condition number \(\kappa \) and tolerance factor \(\epsilon \), we set \(\kappa \) to \(10^4, 10^5, 10^6\) and \(\epsilon \) to \(10^{-6}, 10^{-9}, 10^{-12}\), respectively. For each value of \(\kappa \) or \(\epsilon \), average results over 10 different starting points with entries randomly generated in \([-10,10]\) are presented in the following tables.
The parameter pair \((K_b,K_m,K_s)\) of our four methods was set to (60, 60, 40) for the first three problem sets, and (100, 60, 50) for the last two problem sets. As in [19], the parameter \(\tau \) of the ABBmin2 method was set to 0.9. For the SDC and A1 methods, we chose parameters to achieve best performance. Particularly, the parameter pair (h, s) of SDC was set to (8, 6) for the first three problem sets and (30, 2) for the last two problem sets, while m of A1 was set to 6.
Table 3 shows average number of iterations of the compared methods for the five sets of problems listed in Table 2. We can see that, for the first problem set, our proposed methods perform much better than the BB, DY, SDC and A1 methods, although the ABBmin2 method seems surprisingly efficient among the compared methods. Similar results can be observed for the fourth problem set, our proposed methods are comparable to ABBmin2 and faster than other compared methods. Results on the second problem set are slightly different with the former one, where A1 becomes the winner and our proposed methods outperform BB, DY, ABBmin2 and SDC methods. For the third and last problems sets, those results show evidence of our proposed methods over other compared methods, especially when the condition number is large and a tight tolerance is required.
5 Conclusions and Discussions
We present theoretical analyses on the asymptotic behaviors of a family of gradient methods whose stepsize is given by (11), which includes the steepest descent and minimal gradient methods as special cases. It is shown that each method in this family will asymptotically zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian. In order to accelerate the gradient methods, we exploit the spectral property of a new stepsize to break the zigzagging pattern. This new stepsize is derived by imposing finite termination on minimizing two-dimensional strongly convex quadratics and is proved to converge to the reciprocal of the largest eigenvalue of the Hessian for general n-dimensional case. Finally, we propose a very efficient periodic gradient method that alternately uses the BB stepsize, \(\alpha _k\) in (11) and our new stepsize. Our numerical results indicate that, by exploiting the asymptotic behavior and spectral properties of stepsizes, gradient methods can be greatly accelerated to outperform the BB method and other recently developed state-of-the-art gradient methods.
One may also break the zigzagging pattern by employing spectral property in (47). In particular, we could use the following stepsize
which, by (47), satisfies that
Hence, \( \hat{\alpha }_k\) is also a good approximation of \(1/\lambda _n\) when the condition number \(\kappa =\lambda _n/\lambda _1\) is large. One may see [13] for the case of the SD method.
References
Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
Cauchy, A.: Méthode générale pour la résolution des systemes déquations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
Dai, Y.H.: Alternate step gradient method. Optimization 52(4–5), 395–415 (2003)
Dai, Y.H., Fletcher, R.: On the asymptotic behaviour of some new gradient methods. Math. Program. 103(3), 541–559 (2005)
Dai, Y.H., Huang, Y., Liu, X.W.: A family of spectral gradient methods for optimization. Comput. Optim. Appl. 74(1), 43–65 (2019)
Dai, Y.H., Liao, L.Z.: \(R\)-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
Dai, Y.H., Yang, X.: A new gradient method with an optimal stepsize property. Comput. Optim. Appl. 33(1), 73–88 (2006)
Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Mang. Optim. 1(2), 181 (2005)
De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comput. Optim. Appl. 59(3), 541–563 (2014)
De Asmundis, R., di Serafino, D., Riccio, F., Toraldo, G.: On spectral properties of steepest descent methods. IMA J. Numer. Anal. 33(4), 1416–1435 (2013)
Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Elman, H.C., Golub, G.H.: Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal. 31(6), 1645–1661 (1994)
Fletcher, R.: On the Barzilai-Borwein method. In: Optimization and Control with Applications, pp. 235–256. Springer, Boston (2005)
Forsythe, G.E.: On the asymptotic directions of the s-dimensional optimum gradient method. Numer. Math. 11(1), 57–76 (1968)
Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Mang. Optim. 4(2), 299 (2008)
Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comput. Optim. Appl. 63(2), 523–542 (2016)
Huang, Y., Dai, Y.H., Liu, X.W., Zhang, H.: Gradient methods exploiting spectral properties. Optim. Method Softw. 35(4), 681–705 (2020)
Huang, Y., Liu, H.: Smoothing projected Barzilai-Borwein method for constrained non-Lipschitz optimization. Comput. Optim. Appl. 65(3), 671–698 (2016)
Huang, Y., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai-Borwein method for nonnegative matrix factorization. Data Min. Knowl. Disc. 29(6), 1665–1684 (2015)
Jiang, B., Dai, Y.H.: Feasible Barzilai-Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), 756–784 (2013)
Liu, Y.F., Dai, Y.H., Luo, Z.Q.: Coordinated beamforming for miso interference channel: Complexity analysis and efficient algorithms. IEEE Trans. Signal Process. 59(3), 1142–1157 (2011)
Nocedal, J., Sartenaer, A., Zhu, C.: On the behavior of the gradient norm in the steepest descent method. Comp. Optim. Appl. 22(1), 5–35 (2002)
Pronzato, L., Wynn, H.P., Zhigljavsky, A.A.: Asymptotic behaviour of a family of gradient algorithms in \(R^d\) and Hilbert spaces. Math. program. 107(3), 409–438 (2006)
Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
Sun, C., Liu, J.P.: New stepsizes for the gradient method. Optim. Lett. 14, 1943–1955 (2020)
Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
Yuan, Y.X.: Step-sizes for the gradient method. AMS IP Stud. Adv. Math. 42(2), 785–796 (2008)
Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comput. Optim. Appl. 35(1), 69–86 (2006)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant Nos. 11701137, 11631013, 12071108, 11671116, 11991021, 12021001), the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA27000000), Beijing Academy of Artificial Intelligence (BAAI), Natural Science Foundation of Hebei Province (Grant No. A2021202010), China Scholarship Council (Grant No. 201806705007), and USA National Science Foundation (Grant Nos. DMS-1819161, DMS-2110722).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, Y., Dai, YH., Liu, XW. et al. On the Asymptotic Convergence and Acceleration of Gradient Methods. J Sci Comput 90, 7 (2022). https://doi.org/10.1007/s10915-021-01685-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-021-01685-8