Abstract
Using Taylor’s series, we propose a modified secant relation to get a more accurate approximation of the second curvature of the objective function. Then, using this relation and an approach introduced by Dai and Liao, we present a conjugate gradient algorithm to solve unconstrained optimization problems. The proposed method makes use of both gradient and function values, and utilizes information from the two most recent steps, while the usual secant relation uses only the latest step information. Under appropriate conditions, we show that the proposed method is globally convergent without needing convexity assumption on the objective function. Comparative results show computational efficiency of the proposed method in the sense of the Dolan–Moré performance profiles.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider the unconstrained nonlinear optimization problem:
where \(f:\mathbb {R}^n\rightarrow \mathbb {R}, \) is twice continuously differentiable.
The conjugate gradient (CG) methods are popular in solving large-scale unconstrained minimization problems, since they avoid storing matrices. One step of these iterative methods is defined as follows:
where \(\alpha _k>0\) is a step length and \(d_k\) is the search direction defined by:
where \(g_{k} = \nabla f (x_{k})\) and \(\beta _k \) being a scalar called the CG parameter. Different choices for the CG parameter in (2) lead to different CG methods.
The step length \(\alpha _k\) is determined by an exact/inexact line search technique Nocedal and Wright (2006). Throughout our work here, we utilize the strong Wolfe conditions for determining the step length as follows Nocedal and Wright (2006):
with \(0<\sigma _1<\sigma _2<1.\)
Well-known CG methods include those proposed by Hestenes–Stiefel (HS) (1952), Fletcher–Reeves (FR) (1964), and Polak–Ribire (PR) (Polak and Ribiére 1969; Polyak 1969).
Zoutendijk (1970) established the global convergence of the FR method with the exact line search. Al-Baali (1985) extended this result using inexact line search. Powell (1977) showed that the FR method had a poor performance in practice. In this regard, although PR and HS methods work very well in practice, their global convergence is not guaranteed. To overcome this problem, Powell (1984) suggested another modified conjugate parameter.
Recently, Dai and Liao (2001) proposed some formulae for \(\beta _k\) by exploiting a conjugacy condition based on the standard secant relation:
where \( s_{k} = x_{k+1} -x_{k}\) and \(y_k=g_{k+1}-g_{k}\).
The usual secant relation employs only the gradients and the available function values are ignored. To get a higher order accuracy of approximating the Hessian matrix of the objective function, several researchers have modified the usual secant relation (3) to make full use of both the gradient and function values (see Wei et al. 2006; Yuan and Wei 2010; Yabe and Takano 2004; Zhang and Xu 2001). Wei et al. (2006), using Taylor’s series, modified (3) as follows:
where \(\vartheta _{k}=2 (f_{k}-f_{k+1})+ (g_{k}+g_{k+1})^T s_{k}.\) Recently, Yuan and Wei (2010) extended the modified secant relation (4) as follows:
where \({\tilde{y}}_k =y_k + \frac{\max (\vartheta _k , 0)}{\Vert s_k\Vert ^{2}} s_k\).
Such modified secant relations make use of both the available gradient and function values only at the latest available point. Ford and Moghrabi (1997, (1993, (1994) employed interpolating polynomials, and introduced a different secant relation utilizing information from three most recent points.
Ford and Moghrabi (1993), by constructing interpolating quadratic curves for x(t), modified the secant relation (3) as follows:
where
Notice that \(t_0, t_1\) and \(t_2\) are chosen, such that:
We observe that the modified secant method (6) utilizes information from the three most recent iterates, but does not utilize function values. Here, we use Taylor’s series to construct alternative estimates of the secant relation, using both the available gradient and function values. Moreover, we utilize the information from the three most recent points instead of merely the last two points. Then, we make use of the new secant relation and the idea of Dai and Liao (2001) and Moghrabi (2019) to obtain a new formula for \(\beta _k\) (see Eq. (48)).
The rest of our work is organized as follows. In Sect. 2, we briefly discuss the conjugacy condition proposed by Dai and Liao. In Sect. 3, we use Taylor’s series to construct alternative estimates of the secant relation and by considering this relation we introduce a new formula for CG parameter. In Sect. 4, we investigate the convergence of our proposed method. In Sect. 5, we report numerical results. We conclude in Sect. 6.
2 Dai–Liao’s nonlinear CG methods
Here, we summarize review Dai–Liao’s conjugacy condition. These methods are obtained based on the standard secant equation:
where \(B_k\) is an approximation of the Hessian of f at \(x_k\).
The search direction \(d_k\) is a quasi-Newton direction obtained by solving:
It is well known that the search direction generated by the linear CG methods satisfies the following conjugacy conditions:
where Q is a positive definite Hessian matrix of the quadratic objective function.
For a general nonlinear function f, using the mean value theorem for f(x), there exists \(\tau \in (0, 1)\), such that:
Hence, it is reasonable to replace (11) with:
Dai and Liao (2001) modified the conjugacy condition (12) as follows:
where t is a scaler. Then, using (2) and (13), they obtained the following new formula for \(\beta _{k}\):
Recently, several CG methods based on other secant relations have been investigated (see Babaie-Kafaki et al. 2010; Li et al. 2007; Yabe and Takano 2004). In addition, using the multi-step secant relation (6), Ford et al. (2008) derived following formulas for \(\beta _{k}\):
3 Two-step conjugate gradient method
Here, we use Taylor’s series to construct alternative estimates of the secant relation to approximate the second curvature of the objective function with a high accuracy. Then, we propose a CG parameter by considering a new secant relation and adapting the idea of Dai and Liao (2001).
3.1 A modified secant relation
Here, we intend to make use of three iterates \(x_{k-1},\)\(x_{k}\), and \(x_{k+1},\) generated by some quasi-Newton algorithm. Let
where x(t) is a differentiable curve in \(\mathbb {R}^n.\) Since function f is adequately smooth, it is obvious that \(\varphi (t)\) is sufficiently smooth.
Using the chain rule to function \(\varphi (t)\):
and
Now, suppose that x(t) is the quadratic polynomial interpolating, so that:
Of course, there are various choices for x(t). Using Lagrange interpolation, we have:
where \(L_j(t), \quad j=0, 1, 2,\) are the basic Lagrange polynomials:
Taking the derivative of (17), we obtain:
On the other hand, we know:
Using (19), we can write (18) as:
Moreover, by taking the derivative of (20), we have:
Clearly, we can represent (20) and (21) as follows:
and
where \(\psi \) and \(\delta \) are given by (7).
To determine \(\delta \) and \(\psi \) in (7), the values \(t_j, j=0, 1, 2,\) are needed. For this purpose, we consider one of the successful approaches that was introduced by Ford and Moghrabi (1993). Consider a norm of the general form:
where M is some symmetric positive definite matrix. By choosing \(t_2=0\) and taking M be the identity matrix, \(t_1\) and \(t_0\) can be computed as follows:
Using the Taylor’s expansion formula for the function \(\varphi \),
that is:
Substituting \(t =t_1\) in (26), and using (15), (16) and the fact that \(\varphi (t_2)=f_{k+1}\), \(\varphi (t_1)=f_{k}, \) the relation (26) implies that:
that is:
By considering \(\{t_j\}_{j=0}^{2},\) as (24), the relations (22) and (23) can be written as:
and
where \(\psi \) and \(\delta \) are defined by (7), and \(\eta _1\) is given by:
This implies that:
where
Now, using (28) and (29) and the fact that \(y_{k}=g_{k+1}-g_{k}\) and considering \(B_{k+1}\) as a new approximation of \(G_{k+1}\), the relation (27) implies that:
with
and
Equation (30) provides a new modified secant relation as follows:
where \({\overline{s}}_k\) is given by (31) and \({\overline{y}}_k\) defined by:
We note that this new secant relation contains information from the three most recent points where the usual secant relation makes use of the information merely at the two latest points. Moreover, both the available gradient and function values are being utilized, while the modified BFGS of Ford and Moghrabi (6) made use of only the gradient information from the three most recent points, but ignored the available function values.
A merit of the new modified secant method can be seen from the following theorem.
Theorem 1
Suppose that the function f is sufficiently smooth. If \(\Vert s_k \Vert ,\) is small enough, then we have:
and
where
Proof
The result follows straightforwardly from (25). \(\square \)
Theorem 1 implies that the quantity \({\overline{s}}_{k}^{T}{\overline{y}}_{k}\) generated by the modified secant method approximates the second curvature \( {\overline{s}}_{k}^{T}G_{k+1} {\overline{s}}_{k}\) with a higher accuracy than the quantity \(r_{k}^{T}w_{k}\) does.
Remark 1
It is easy to see that:
This relation implies that:
Moreover:
Here, we consider some common assumptions.
Assumption A
The level set \(D = \{x \mid f(x) \le f(x_0)\}\) is bounded.
Assumption B
The function f is continuously differentiable on D, and there is a constant \(L \ge 0\), such that, for all \( x, y \in D,\)\(\parallel g(x) - g(y) \parallel \le L \parallel x - y \parallel \).
It is clear that assumptions A and B imply that there is a positive constant \(\gamma \), such that:
Lemma 1
Suppose that the assumption A holds and \({\overline{s}}_k\) defined by (31). Then, we have:
Proof
By the definitions of \(t_0, t_1\) and \(t_2\), we can get:
We see that \(\delta >0\) and \(\eta _{1} =\frac{t_0+t_1}{t_0}>0,\) then we can deduce the following inequalities:
Now, using (31), (38), and (41), we have:
Moreover, we can give:
Hence, (42) and (43) give the desired result (40). \(\square \)
3.2 A new formula for \(\beta _k\)
Here, considering the approach proposed by Dai and Liao, and using our modified secant relation given by (32), we propose a new conjugacy relation:
Using the conjugacy relation (44), we present a new formula for \(\beta _k\):
For a general function f, it is possible that the denominator, \(d_{k-1}^{T} {{\overline{y}}}_{k-1},\) in (45) becomes zero. To overcome the difficulty, we make a modification on (32) as follows:
with
where C is a positive constant.
Now, using the secant relation (46), we obtain another CG parameter as follows:
From (38) and (47), it is not difficult to see that:
We can now give a new CG algorithm using our proposed CG parameter given by (48) as follows.
Algorithm 1: New conjugate gradient method
Step 1: Give \( \varepsilon , \) as a tolerance for stopping criterion, \( C>0, \)\( \lambda > 0 \), \(\sigma _1 \in (0,1)\) and \(\sigma _2\in (\sigma _1,1),\) a starting point \(x_0 \in \mathbb {R}^n.\) Set \(\alpha _{initial}=1\), \(d_0=-g_0,\) and \(k = 0\).
Step 2: If \(\Vert g_k \Vert <\varepsilon \) then stop.
Step 3: Compute the step length \(\alpha _k\) to satisfy the strong Wolfe conditions with the initial step length set to \(\alpha _{initial}\).
Step 4 : Set \(x_{k+1} = x_k + \alpha _k d_k.\)
Step 5 : Compute \({\overline{s}}_k\) and \(y_{k}^{*}\) by (31) and (47), respectively. Then, calculate \(\beta _{k} \) by:
Step 6 : Compute \(d_{k+1}\) by \(d_{k+1}=-g_{k+1}+\beta _kd_k\).
Step7: Compute the initial step length by:
Step 8 : Set \(k = k + 1\) and go to Step 2.
In the next section, we proceed to establish the global convergence of Algorithm 1.
4 Convergence analysis
First, we consider the following definition.
Definition 1
For the search directions \(\{d_k\}\), we say that the descent condition holds if and only if:
To establish convergence of Algorithm 1, we first provide some lemmas.
Lemma 2
Let f satisfies assumptions A and B. For \({\overline{y}}_{k}\) defined by (33), we have:
where \(M_1 > 0\) is a constant.
Proof
Using assumptions A and B and Remark 1, we obtain:
where \(\theta \in [0, 1]\). Thus, we can deduce that there exists constant \(a_1>0\), such that:
Also, from definition of \({\overline{y}}_k\) and relation (40), we have:
Therefore, the inequality (50) holds with \(M_1=\frac{3}{2}L+6a_1.\)\(\square \)
Corollary 1
Let f satisfies assumptions A and B, and \(y^{*}_{k}\) be defined by (47). Then, there exists a constant \(M_2>0\), such that:
Proof
Considering Lemma 2 and assumptions A and B, we have:
which implies (51) with \(M_2=(2M_1 +Cm_2\gamma ) \Vert {s}_{k}\Vert .\)\(\square \)
For any CG method using the strong Wolfe line search conditions, the following general result was established in Dai et al. (1999).
Lemma 3
Dai et al. (Dai et al. 1999) Let assumptions A and B hold. Consider any CG method in the form (1)–(2), where \(d_k\) is a descent direction and \(\alpha _{k}\) is computed using the strong Wolfe line search conditions. If
then we have:
Next, we establish the global convergence of Algorithm 1.
Theorem 2
Let f satisfies the Assumptions A, B and \(\{x_k\}\) be generated by Algorithm 1. Then, we have:
Proof
In view of Lemma 3, it suffices to show that (52) holds.
From (49), we have:
On the other hand, using (40), we have:
Consequently, from (2), (54) and (56), we have:
Now, summing up over k, we get:
Thus, by Lemma 3, the proof is complete. \(\square \)
5 Numerical results
Numerical results presented by Ford et al. (2008) show that the CG methods with \({\beta }^{FNY}_k\) as given by (14), perform better than the DL method of Dai and Liao (2001), the YT method of Yabe and Takano (2004), and HS method of Hestenes and Stiefel (1952). Thus, here, we compare the performance of our CG method, denoted by M1, with the aforementioned CG method proposed in Ford et al. (2008), denoted by M2. Therefore, our results are reported for the following two algorithms:
M1: Our proposed method (Algorithm 1).
M2: The CG method of Ford et al. (2008) (Algorithm 1 with \( \beta ^{FNY}_k\) as given by (14)).
Since CG methods are mainly designed for large-scale problems, our experiments were made on a set of 75 test problems of the CUTEr library Gould et al. (2015) with the minimum dimension being equal to 1000. A summary of these problems is given in Table 1.
All the codes were written in Matlab 2011 and run on a PC with CPU Intel(R) Core(TM) i5-4200 3.6 GHz, 4 GB of RAM memory, and Centos 6.2 server Linux operating system. The algorithms stop when
or the total number of iterates exceeds 3000.
We note that we terminated an algorithm and considered it as failed on a problem, when the number of iterations exceeded 3000.
For the two algorithms in all cases, the steplength \(\alpha _k\) is computed using the strong Wolfe conditions with \(\sigma _1 =0.001,\)\(\sigma _2 =0.1.\) For this, we used the line search procedure of Moré and Thuente (1994). Moreover, we set \(C=10^{-6}\) in the relation (47).
It is remarkable that numerical performance of the DL method is very dependent on the choice of the parameter t. Recently, work has been done to propose proper selections of t. Here, we utilize the parameter t, as follows Babaie-Kafaki and Ghanbari (2014):
We used the performance profiles of Dolan and Moré (2002) to evaluate the performance of these two algorithms with respect to CPU time, the number of iterations, and the total number of function and gradient evaluations computed as:
where \(N_{\text {f}}\) and \(N_{\text {g}}\), respectively, denote the number of function and gradient evaluations. Figures 1, 2, and 3 demonstrate the results of the comparisons. From these figures, it is clearly observed that Algorithm 1 (M1) is more efficient as compared to the method of Ford et al. (2008).
6 Conclusion
We presented a secant relation for unconstrained optimization problems and approximated the second curvature of the objective function with a high accuracy. Then, based on the proposed secant relation and adapting the approach of Dai–Liao, we presented a conjugate gradient method to solve unconstrained optimization problems. An interesting feature of the proposed method was taking both the gradient and function values into account. Another important property of the proposed method was the utilization of information from the three most recent iterates instead of the last two iterates. Under suitable assumptions, we have established the global convergence of the proposed method without a convexity assumption on the objective function. Numerical results on the collection of problems in CUTER show that the proposed method is efficient as compared to the conjugate gradient method proposed by Ford et al.
References
Al-Baali M (1985) Descent property and global convergence of the Fletcher- Reeves method with inexact linesearch. IMA J. Numer. Anal. 5:121–124
Babaie-Kafaki S, Ghanbari R (2014) The Dai-Liao nonlinear conjugate gradient method with optimal parameter choices. Eur. J. Oper. Res. 234:625–630
Babaie-Kafaki S, Ghanbari R, Mahdavi-Amiri N (2010) Two new conjugate gradient methods based on modified secant relations. J. Comput. Appl. Math. 234:1374–1386
Dai YH, Han JY, Liu GH, Sun DF, Yin HX, Yuan YX (1999) Convergence properties of nonlinear conjugate gradient methods. SIAM J. Optim. 10:348–358
Dai YH, Liao LZ (2001) New conjugacy conditions and related nonlinear conjugate gradient methods. Appl. Math. Optim. 43:87–101
Dolan ED, Moré JJ (2002) Benchmarking optimization software with performance profiles. Math. Program. 91:201–213
Fletcher R, Revees CM (1964) Function minimization by conjugate gradients. Comput. J. 7:149–154
Ford JA, Moghrabi IA (1997) Alternating multi-step quasi-Newton methods for unconstrained optimization. J. Comput. Appl. Math. 82:105–116
Ford JA, Moghrabi IA (1993) Alternative parameter choices for multi-step quasi-Newton methods. Optim. Methods Softw. 2:357–370
Ford JA, Moghrabi IA (1994) Multi-step quasi-Newton methods for optimization. J. Comput. Appl. Math. 50:305–323
Ford JA, Narushima Y, Yabe H (2008) Multi-step nonlinear conjugate gradient methods for unconstrained minimization. Comput. Opt. Appl. 40:191–216 21-42 (1992)
Gould NI, Orban D, Toint PhL (2015) CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comput. Opt. Appl. 60:545–557
Hestenes MR, Stiefel EL (1952) Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 49:409–436
Li G, Tang C, Wei Z (2007) New conjugacy condition and related new conjugate gradient methods for unconstrained optimization. J. Comput. Appl. Math. 202:523–539
Moré JJ, Thuente DJ (1994) Line search algorithms with guaranteed sufficient decrease. ACM Trans. Math. Softw. 20:286–307
Moghrabi IA (2019) A new preconditioned conjugate gradient method for optimization. IAENG Int. J. Appl. Math. 49(1):1–8
Nocedal J, Wright SJ (2006) Numerical Optimization. Springer, New York
Polak E, Ribiére G (1969) Note Sur la Convergence de Directions Conjuguée. Francaise Informat Recherche Operationelle 16:35–43
Polyak BT (1969) The conjugate gradient method in extreme problems. USSR Comput. Math. Math. Phys. 9:94–112
Powell MJD (1977) Restart procedures of the conjugate gradient method. Math. Program. 2:241–254
Powell MJD (1984) Nonconvex minimization calculations and the conjugate gradient method, Numerical Analysis (Dundee, 1983) Lecture Notes in Mathematics, vol 1066. Springer, Berlin, pp 122–141
Wei Z, Li G, Qi L (2006) New quasi-Newton methods for unconstrained optimization problems. Appl. Math. Comput. 175:1156–1188
Yuan G, Wei Z (2010) Convergence analysis of a modified BFGS method on convex minimizations. Comput. Opt. Appl. 47:237–255
Yabe H, Takano M (2004) Global convergence properties of nonlinear conjugate gradient methods with modified secant relation. Comput. Optim. Appl. 28:203–225
Zhang JZ, Xu CX (2001) Properties and numerical performance of quasi-Newton methods with modified quasi-Newton equation. J. Comput. Appl. Math. 137:269–278
Zoutendijk G (1970) Nonlinear programming, computational methods. In: Abadie J (ed) Integer and nonlinear programming. North-holland, Amsterdam, pp 37–86
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Andreas Fischer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dehghani, R., Bidabadi, N. Two-step conjugate gradient method for unconstrained optimization. Comp. Appl. Math. 39, 241 (2020). https://doi.org/10.1007/s40314-020-01297-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s40314-020-01297-2