1 Introduction

Consider the following unconstrained optimization problem:

$$\begin{aligned} \mathop {\min }\nolimits _{x\in R^{n}} f(x), \end{aligned}$$
(1)

in which \(f:R^{n}\rightarrow R\) is a twice continuously differentiable function. Trust region and line search methods are two popular approaches in the literature for solving these kinds of optimization problems. Line search methods refer to a procedure that computes a steplength \(\alpha _k \) in the specific direction \(d_k \) at current point \(x_k \) and generates a new point as \(x_{k+1} \,=x_k \,+\,\alpha _k d_k \). It is well known that in these methods, one may require the Hessian approximation matrix to be positive definite for ensuring that the search direction is a descent direction, see e.g. [17, 18].

Trust region (TR) methods are another class of iterative methods for solving unconstrained optimization problems [5]. These methods generate a sequence of points that converges to a point in which the first and second-order necessary conditions hold under some mild assumptions. Moreover, they can be applied to ill-conditioned problems, have strong global convergence properties and do not require the Hessian approximation to be positive definite. Due to their strong convergence property and robustness, they have been intensively studied in the literature [5, 17]. In the classical TR methods, at the current point \(x_k\), a trial step \(d_k\) is computed by approximately solving the following TR subproblem:

$$\begin{aligned}&{\min m_k \left( d \right) =g_k^T d+\,\frac{1}{2}d^{T}B_k d} \nonumber \\&s.t. \quad \parallel d\parallel \le \Delta _k , \end{aligned}$$
(2)

where \(\Vert .\Vert \) can be an arbitrary vector norm, usually the Euclidean norm, \(g_k=\nabla f({x_k })\), \(B_k\) is the exact Hessian, i.e. \(\nabla ^{2}f\,({x_k })\), or its symmetric approximation and \(\Delta _k \,\) is the TR radius. Then, the agreement between the actual and the predicted reductions is computed by the so called TR ratio \(r_k \), which is defined by:

$$\begin{aligned} r_k =\frac{Ared_k }{Pred_k }\!, \end{aligned}$$

where the actual reduction \(Ared_k \) and the predicted reduction \(Pred_k \) are given by:

$$\begin{aligned} Ared_k&:= f({x_k })-f\left( {x_k +d_k } \right) \!, \end{aligned}$$
(3)
$$\begin{aligned} Pred_k&:= m_k \left( 0 \right) -m_k (d_k ). \end{aligned}$$
(4)

Based on the magnitude of \(r_k \), the classical TR methods decide whether the trial step is accepted or rejected. More precisely, for a given \(\mu \in \left( {0,\,1} \right) ,\) if \(r_k \ge \mu \), then the trial step is accepted and the new point is introduced by \(x_{k+1} =x_k +d_k \). In this case, the TR radius is updated appropriately. On the other hand, if the actual reduction is poor compared with the predicted reduction, i.e. \(r_k <\mu \), then the trial step is rejected and the current point remains unchanged for the next iteration. In this case, the TR radius is shrunk in an appropriate manner. This process is repeated until the convergence criteria hold, see e.g. [5, 12, 18, 20].

Appropriately choosing the initial radius and the way of updating TR radii are crucial points in the performance of standard TR methods, see e.g. [1, 15, 2224, 30, 31], and may cause a meaningful decrease in the number of subproblem solving. Sartenaer [22] proposed a strategy for automatically determining an initial radius by letting it to be \(\parallel g_0 \parallel \). Later, in practical point of view, Lin and Moré [15] introduced a better choice for the initial radius by letting \(\Delta _0 =\,\kappa \parallel g_0 \parallel \), where \(\kappa \) is a positive constant. It is worth mentioning that when the sequence \(\{x_k \}\), generated by the classical TR algorithm, converges to a minimizer \(x^{*}\), the ratio \(\{r_k \}\) may converge to 1. Thus, for sufficiently large \(k\), the TR radius might be larger than a positive constant. In this case, the trust region constraint doesn’t play any role at the end. Due to this fact, Fan and Yuan [10] proposed a TR method with the radius \(\Delta _k \) converging to zero. In their approach, the radius is introduced by \(\Delta _k =\,v_k \parallel g_k \parallel \), where \(v_k\) is updated according to the magnitude of the ratio \(r_k \). Zhang et al. [30] suggested another scheme for adaptively determining the radius. They employed the adaptive formula \(\Delta _k =c^{p}\parallel g_k \parallel \parallel \hat{{B}}_k^{-1} \parallel \) in their scheme, where \(c\in (0,1)\) is a constant, \(p\) is a nonnegative integer and \(\hat{{B}}_k =B_k +iI\) is a positive definite matrix, for some \(i\in \,N\). Recently, some variants of adaptive trust region methods based on the following updating formula have been proposed in [6, 21]:

$$\begin{aligned} \Delta _k =v_k \parallel g_k \parallel \parallel \hat{{B}}_k^{-1} \parallel , \end{aligned}$$
(5)

where \(v_k \) is updated according to the magnitude of \(r_k \).

Despite using the current gradient and Hessian information in Zhang’s method, computing \(\Delta _k \) in (5) requires an estimation of \(\parallel \hat{B}_k^{-1} \parallel \) in each iteration, which causes some extra computational costs. In order to overcome this drawback, a simple adaptive trust region method was proposed by Shi and Wang [24], in which the radius is updated by \(\Delta _k =c^{p}\parallel g_k \parallel ^{3}/g_k^T \hat{{B}}_k g_k \), where \(c\in (0,1)\) is a constant, \(\hat{B}_k \) is a positive definite matrix and \(p\) is a nonnegative integer. A practically efficient and globally convergent adaptive TR method has been suggested by Shi and Guo in [23]. In their method, the radius is updated by \(\Delta _k =\alpha _k \parallel q_k \parallel \), where \(q_k \) is chosen so that, for given \(\tau \in (0,1]\),

$$\begin{aligned} -\frac{g_k^T q_k }{\parallel g_k \parallel .\parallel q_k \parallel }\ge \tau . \end{aligned}$$
(6)

Moreover, for given \(\rho \in (0,1)\), \(\alpha _k \) is the largest possible \(\alpha \in \left\{ {s_k ,\,\rho s_k ,\,\rho ^{2}s_k ,\ldots } \right\} \), so that \(r_k \ge \mu \), where \(s_k \) is determined by

$$\begin{aligned} s_k =-\frac{g_k^T q_k }{q_k^T \hat{{B}}_k q_k }, \end{aligned}$$
(7)

and \(\hat{{B}}_k \,\)is defined by

$$\begin{aligned} \hat{B}_k =B_k +iI, \end{aligned}$$
(8)

in which \(I\) is the identity matrix and \(i\) is the smallest nonnegative integer so that \(\hat{B}_k \) is a positive definite matrix.

In the monotone trust region methods, the value of objective function is required to be decreased in each iteration. It has been shown that this procedure may result in slow convergence rate in some problems [13, 14]. One way to overcome this difficulty is to use nonmonotone techniques in the TR frameworks. Apparently, the first nonmonotone line search technique is the so called watchdog technique, which was proposed by Chamberlain et al. [4]. Later, Grippo et al. [13] provided a nonmonotone line search technique for Newton’s method and extended it for solving unconstrained optimization problems [14]. In their approach, for given \(\gamma \in (0,\,1)\), the steplength \(\alpha _k \) is chosen so that the following condition holds:

$$\begin{aligned} f\left( {x_k \,+\,\alpha _k d_k } \right) \le f_{l\left( k \right) } +\gamma \alpha _k \nabla f(x_k )^{\mathrm{T}}d_k , \end{aligned}$$

where the nonmonotone term \(f_{l(k)} \) is defined by

$$\begin{aligned} f_{l(k)} =\mathop {\max }\nolimits _{0\le j\le m(k)} \left\{ {f\left( {x_{k-j} } \right) } \right\} \!, \end{aligned}$$
(9)

in which, for given nonnegative integer \(M\), \(m\left( 0 \right) =0\) and \(0\le m\left( k \right) \le \min \,\{m(k-1)+1,M\}\), for all \(k\ge 1\).

Nowadays, due to the behavior of nonmonotone techniques in practice, several authors have been fascinated to employ nonmonotone strategies in the optimization methods, especially in TR methods. The first variant of this kind goes back to the work of Deng et al. [8] in which the ratio \(r_k \) is changed according to the nonmonotone strategy as proposed in [13]. Later on, several works have been developed based on various nonmonotone techniques. Among them are the works proposed by Zhou and Xiao [28, 32], Xiao and Chu [27], Toint [25, 26], Dai [7] and Panier and Tits [19]. However, the Grippo’s nonmonotone technique has some disadvantages, see e.g. [2, 29]. In order to overcome these drawbacks, Zhang and Hager [29] suggested another nonmonotone line search technique in which the maximum over the function values in (9) is replaced by an average of the function values. More precisely, in their approach, for given \(\gamma \in (0,\,1)\), the steplength \(\alpha _k \) is chosen so that

$$\begin{aligned} f\left( {x_k \,+\,\alpha _k d_k } \right) \le C_k +\gamma \alpha _k \nabla f(x_k )^{\hbox {T}}d_k , \end{aligned}$$

where the nonmonotone term \(C_k \) is defined by

$$\begin{aligned} C_k =\left\{ {{\begin{array}{lll} f({x_k }) &{}\quad \hbox { if }&{} \,k=0,\, \\ \frac{\left( {\vartheta _{k-1} Q_{k-1} C_{k-1} +f({x_k })} \right) }{Q_k } \,&{}\quad \hbox { if }&{} \,k\ge 0, \\ \end{array} }} \right. \end{aligned}$$
(10)

and

$$\begin{aligned} Q_k =\left\{ {{\begin{array}{lll} 1&{}\quad \hbox { if }&{} \,k=0,\, \\ \vartheta _{k-1} Q_{k-1} +1 &{}\quad \hbox { if }&{} \,k\ge 0, \\ \end{array} }} \right. \end{aligned}$$

where \(\vartheta _{k-1} \in \left[ {\vartheta _{\mathrm{min}} , \vartheta _{\mathrm{max}} } \right] \) and \(0\le \vartheta _{\mathrm{min}} <\vartheta _{\mathrm{max}} \le 1\).

Recently, the nonmonotone techniques and adaptive strategies are simultaneously employed in the framework of classical TR methods in order to propose an efficient algorithm in terms of convergence property and practical performance. The first work in this area goes back to 2003 in which Zhang et al. [31] combined the Grippo’s nonmonotone technique with adaptive trust region method. Fu and Sun [11] used Zhang’s adaptive method with a new structured nonmonotone technique in which the predicted reduction is computed in a different way than the classical TR methods. In a recent work, Cui and Wu [6] provided a nonmonotone adaptive approach based on a combination of the nonmonotone term (10) with the adaptive strategy as provided in [30]. A combination of a variant of Shi and Guo’s adaptive scheme with Grippo’s nonmonotone technique has been done by Ahookhosh and Amini [1]. They showed that their method is practically efficient while it has global convergence property under some standard assumptions. As the Grippo’s nonmonotone technique suffers from some drawbacks [2, 29], Ahookhosh and Amini in [2] proposed a new nonmonotone technique as below

$$\begin{aligned} R_k =\varepsilon _k f_{l\left( k \right) } +\left( {1-\varepsilon _k } \right) f({x_k }), \end{aligned}$$
(11)

where \(\varepsilon _k \in \left[ {\varepsilon _{\mathrm{min}} ,\varepsilon _{\mathrm{max}} } \right] \), \(\varepsilon _{\mathrm{min}} \in \left[ {0,1} \right) \), \(\varepsilon _{\mathrm{max}} \in (\varepsilon _{\mathrm{min}} ,1]\) and \(f_{l(k)} \) is defined by (9). This nonmonotone term is motivated from the fact that the best convergence results are obtained by stronger nonmonotone strategy whenever the iterates are far from the optimum and by weaker nonmonotone strategy whenever the iterates are close enough to the optimum, see e.g. [29]. By this definition, a stronger nonmonotone strategy is obtained whenever \(\varepsilon _k \) is close to 1 and a weaker strategy is followed whenever \(\varepsilon _k \) is close to 0. Although, the proposed algorithm in [2] has some appealing properties, especially in practical performance, it roughly uses \(R_k =f(x_k )\) in the first iterations until \(f_{l(k)}\) could be able to play an active role in (11). Therefore, in the first iterations, this fact may limit the performance of the proposed algorithm in [2] as the bottom of steep and curved valleys in some problems may happen in the first iterations. In order to overcome this difficulty, in this paper, we propose a new relaxed nonmonotone technique by using (11).

In this paper, we propose a new adaptive TR algorithm which incorporates the Shi and Guo’s adaptive trust region method with a variant of nonmonotone technique as provided in [2]. Our approach aims to relax the acceptance of the trial step \(d_k \) by allowing the new nonmonotone term to be larger than \(R_k \), especially in the first iterations, while keeping the properties of \(R_k \) to be held. We establish the global convergence property as well as the local superlinear convergence rate of the new algorithm under some suitable and standard assumptions. The proposed method is implemented in MATLAB environment and applied on some test problems. Numerical results confirm that the proposed algorithm is practically efficient, too.

The rest of the paper is organized as follows: In Sect. 2, we describe the structure of our new proposed nonmonotone adaptive trust region algorithm. Section 3 is devoted to establish the local and global convergence properties of the algorithm under some standard assumptions. The superlinear convergence rate is provided in Sect. 4. Numerical results of performing the new algorithm on some test problems, taken from the literature, are given in Sect. 5. We end up the paper by some concluding remarks in Sect. 6.

2 The new algorithm

In this section, we propose a new adaptive nonmonotone trust region algorithm based on a combination of the nonmonotone technique proposed in [2] and the Shi and Guo’s adaptive scheme provided in [23]. In our algorithm, the new nonmonotone term is defined by \(\left( {1+\varphi _k } \right) R_k \), where \(R_k \) is given by (11) and \(\varphi _k \) is determined by

$$\begin{aligned} \varphi _k =\left\{ {{\begin{array}{lll} \eta _k &{}\,\quad \hbox {if}&{}R_k >0, \\ 0 &{}\quad \, \hbox {if}&{}R_k \le 0, \\ \end{array} }} \right. \end{aligned}$$
(12)

in which \(\left\{ {\eta _k } \right\} \) is a positive sequence satisfying the following condition:

$$\begin{aligned} \sum \limits _{k=0}^\infty \eta _k \le \eta <\infty . \end{aligned}$$
(13)

It is worth mentioning that, as \(k\rightarrow \infty \), we have \(\eta _k \rightarrow 0\), and therefore the proposed nonmonotone term converges to that proposed in [2]. Despite the nonmonotone term (11), the properties of strong nonmonotone strategy are taken into account in the proposed relaxed term, especially in the first iterations.

The trust region radius is updated by \(\Delta _k =\,\min \left\{ {v_k s_k \parallel q_k \parallel ,\,\Delta _{\mathrm{max}} } \right\} \), where \(q_k \) and \(s_k \) are the parameters of the Shi and Guo’s adaptive scheme, defined by (6) and (7), respectively, and \(v_k \) is appropriately adjusted in each iteration according to the magnitude of the following nonmonotone ratio:

$$\begin{aligned} r_k =\frac{\left( {1+\varphi _k } \right) R_k -f\left( {x_k +d_k } \right) }{Pred_k }. \end{aligned}$$
(14)

The whole procedure of the new nonmonotone adaptive TR algorithm for solving (1) is outlined in Algorithm 2.1.

figure a

Throughout the paper, we use the following two index sets in our analysis:

$$\begin{aligned} I=\left\{ {k\,:\,r_k \ge \,\mu _2 } \right\} , J=\left\{ {k\,:\,r_k <\mu _2 } \right\} \!. \end{aligned}$$

We refer to the \(k\)-th iteration as a successful iteration whenever \(x_{k+1} =x_k +d_k \), i.e. \(k\in I\), and as an unsuccessful iteration whenever \(x_{k+1} =x_k \), i.e. \(k\in J\).

Remark 2.1

Let the index set \(I\) be denoted by \(\left\{ {k_0 ,k_1 ,k_2 ,\ldots } \right\} .\) For a given nonnegative integer \(M\), set \(m\left( 0 \right) =0\) and \(0\le m\left( {k_i } \right) \le \min \,\{m(k_{i-1} )+1,\,M\}\). At the \(k\)th iteration, let \(k_i \in I\) be the largest index so that \(k_i \le k\). In our setting, we define \(f_{l(k)} =\mathop {\max }\nolimits _{0\le j\le m(k_i )} f_{k_{i-j} } \). Indeed, \(f_{l(k)} \) is considered as the maximum value of \(f(x)\) over the last \(m(k_i )+1\) successful iteration points.

3 Convergence analysis

In this section, our aim is to establish convergence properties of Algorithm 2.1 under some suitable assumptions. To do so, the following assumptions are considered on the problem:

A1:

The level set \(\Omega =\{x\in R^{n}\hbox {|}f\left( x \right) \le e^{\eta }\left| {f\,\left( {x_0 } \right) } \right| \}\) is a closed and bounded set, where \(\eta \) is defined by (13) and \(f\) is a twice continuously differentiable function over \(\Omega \).

A2:

Matrices \(B_k \) are uniformly bounded, i.e., there exists a positive constant \(m_1 \) so that, for all \(k\in N\cup \{0\}\), we have \(\parallel B_k \parallel \le m_1\).

Remark 3.1

Assumption A1 implies that there exists a positive constant \(m_2 \), so that \(\parallel \nabla ^{2}f(x_k )\parallel \le m_2 \), for all \(x_k \in \Omega \).

Remark 3.2

Assumption A2 implies that the matrices \(\hat{B}_k \) are also uniformly bounded. This can be proved by considering the procedure of generating \(\hat{B}_k \) in Shi and Guo’s approach. Indeed, Assumption A2 implies that \(\parallel \hat{B}_k \parallel =\parallel B_k +iI\parallel \le 2m_1 +1\), where the inequality is obtained from the fact that (8) holds for \(m_1 <\,i\,\le \,m_1 +\,1\).

Remark 3.3

In order to analyze the convergence property of Algorithm 2.1, we assume that the trial step \(d_k \) is approximately computed by Algorithm 2.6 in [18]. Therefore, as it has been shown in [18], there exists a constant \(\theta \in (0,1)\) so that \(d_k \) satisfies the following inequality:

This inequality is known as a sufficient reduction condition in the literature and implies that \(d_k \ne 0\) whenever \(g_k \ne 0\).

Note that, throughout this section, we use the definition of \(f_{l(k)} \) as given in Remark 2.1.

Lemma 3.1

For all \(k\), we have

where \(Pred_k \) is defined by (4).

Proof

The proof is easily obtained by using Taylor’s expansion, Assumptions A2 and Remark 3.1. \(\square \)

Lemma 3.2

Let Assumptions A1 and A2 hold and the sequence \(\left\{ {x_k } \right\} \) be generated by Algorithm 2.1. Moreover, suppose that there exists a constant \(\delta \in (0,\,1)\), so that \(\parallel g_k \parallel >\delta \), for all \(k\). Then, for each \(k\), there exists a nonnegative integer \(p\), so that \(x_{k+p+1} \) is a successful iteration point.

Proof

By contrary, suppose that there exists an iteration \(k\) so that, for all nonnegative integer \(p\), \(x_{k+p+1} \) is an unsuccessful iteration point, i.e.

Using (8), we have \(q_{k+p}^T \hat{B}_{k+p} q_{k+p} >0\). Therefore, there exists a sufficiently small positive constant \(\varrho \) so that

$$\begin{aligned} 0<\varrho \parallel q_{k+p} \parallel ^{2}\le q_{k+p}^T \hat{B} _{k+p} q_{k+p} . \end{aligned}$$

Thus, from Step 4 of Algorithm 2.1 and (7), we have

$$\begin{aligned} \Delta _{k+p+1}&\le \sigma _0^{p+1} v_{k\,} s_{k+p} \parallel q_{k+p} \parallel \\&= -\sigma _0^{p+1} v_k \frac{g_{k+p}^T q_{k+p} }{q_{k+p}^T \hat{B}_{k+p} q_{k+p} }\parallel q_{k+p} \parallel \\&\le \sigma _0^{p+1} v_k \frac{\parallel g_{k+p} \parallel \parallel q_{k+p} \parallel }{\varrho \parallel q_{k+p} \parallel ^{2}}\parallel q_{k+p} \parallel \\&\le \frac{v_{\mathrm{max}} \parallel g_k \parallel }{\varrho }\sigma _0^{p+1} ,\\ \end{aligned}$$

where the second inequality is followed by applying the Cauchy–Schwartz inequality and the last inequality is obtained from (16) and the fact that in the unsuccessful iterations, we have \(x_{k+p} =x_k \). Now, since \(\sigma _0 \in (0,\,1)\), the latter inequality implies that

$$\begin{aligned} \mathop {\lim }\limits _{p\rightarrow \infty } \Delta _{k+p+1} =0. \end{aligned}$$

Therefore, from Lemma 3.1 and (17), we have

$$\begin{aligned} \left| {\frac{f\left( {x_{k+p} } \right) -f\left( {x_{k+p} +d_{k+p} } \right) }{Pred_{k+p} }-1} \right|&= \left| {\frac{f\left( {x_{k+p} } \right) -f\left( {x_{k+p} +d_{k+p} } \right) -Pred_{k+p} }{Pred_{k+p} }} \right| \\&\le \frac{O(\parallel d_{k+p} \parallel ^{2})}{\theta \parallel \,g_{k+p} \parallel \min \left\{ {\Delta _{k+p} ,\frac{\parallel \,g_{k+p} \parallel }{\parallel B_{k+p} \parallel }} \right\} }\\&\le \frac{O\left( {\parallel \Delta _{k+p} \parallel ^{2}} \right) }{\theta \delta \min \left\{ {\Delta _{k+p} ,\frac{\delta }{m_1 }} \right\} }\!. \end{aligned}$$

This implies that \(\left| {\frac{f\left( {x_{k+p} } \right) -f\left( {x_{k+p} +d_{k+p} } \right) }{Pred_{k+p} }-1} \right| \rightarrow 0\), as \(p\rightarrow \infty \). Thus, for sufficiently large \(p\), we have

$$\begin{aligned} r_{k+p} =\frac{\left( {1+\varphi _{k+p} } \right) R_{k+p} -f\left( {x_{k+p} +d_{k+p} } \right) }{Pred_{k+p} }\ge \frac{f\left( {x_{k+p} } \right) -f\left( {x_{k+p} +d_{k+p} } \right) }{Pred_{k+p} }\rightarrow 1, \end{aligned}$$

which contradicts (19). This completes the proof of the lemma.\(\square \)

Remark 3.4

As a consequence of Lemma 3.2, one can realize that whenever Algorithm 2.1 does not stop in the finite number of iterations, i.e. the sequence \(\left\{ {x_k } \right\} \) is an infinite sequence, then the index set \(I\) is infinite.

Lemma 3.3

Suppose that \(\left\{ {x_k } \right\} \) is generated by Algorithm 2.1. Then, we have

where \(\omega _k :=\theta \mu _2 \min \left\{ {\parallel g_k \parallel \Delta _k ,\frac{\parallel \,g_k \parallel ^{2}}{\parallel B_k \parallel }} \right\} \ge 0\), \(\mu _2 \in (0,\,1)\) and \(\theta \in (0,\,1)\) is the same constant as in Remark 3.3.

Proof

Let \(k\in I\), then \(x_k +d_k \) is a successful iteration point. From (14) and (17), for all \(k\in I\), we have

We proceed the proof by induction. To establish the first step of the induction, the following two cases are considered:

Case 1 Let \(k\,=\,0\) be a successful iteration. In this case, using (21), we obtain

$$\begin{aligned} f_1 \le \left( {1+\varphi _0 } \right) R_0 -\omega _0 =\left( {1+\varphi _0 } \right) f_0 -\omega _0 \le \left( {1+\varphi _0 } \right) \left| {f_0 } \right| -\omega _0 , \end{aligned}$$

where the equality is followed from (11) and the fact that \(\varphi _0 \ge 0\), by (12). Assume that (20) holds for \(k\,\ge 1\), i.e.,

Due to Lemma 3.2, there exists the smallest positive integer \(p\), so that \(k+\,p\) is a successful iteration. We show that (20) holds for \(k+p\in I\), i.e.,

$$\begin{aligned} f_{k+p+1} \le \left| {f_0 } \right| \mathop {\prod }\limits _{i=0}^{k+p} \left( {1+\varphi _i } \right) -\omega _{k+p} . \end{aligned}$$

For this purpose, from (12) and (21), we have

$$\begin{aligned} f_{k+p+1}&\le \left( {1+\varphi _{k+p} } \right) R_{k+p} -\omega _{k+p} \\&= \left( {1+\varphi _{k+p} } \right) R_{k+1} -\omega _{k+p} \\&= \left( {1+\varphi _{k+p} } \right) \left[ {\varepsilon _k f_{l\left( {k+1} \right) } +\left( {1-\varepsilon _k } \right) f_{k+1} } \right] -\omega _{k+p} \\&\le \left( {1+\varphi _{k+p} } \right) \left[ {\varepsilon _k f_{l\left( {k+1} \right) } +\left( {1-\varepsilon _k } \right) f_{l(k+1)} } \right] -\omega _{k+p} \\&\le \left( {1+\varphi _{k+p} } \right) f_{l\left( {k+1} \right) } -\omega _{k+p} \\&\le \left( {1+\varphi _{k+p} } \right) \left| {f_0 } \right| \mathop \prod \limits _{i=0}^{l\left( {k+1} \right) -1} \left( {1+\varphi _i } \right) -\omega _{k+p}, \\ \end{aligned}$$

where the first equality is obtained from the fact that iterations \(k+1;\ldots \,;k+p-1\) are unsuccessful iterations, and therefore \(R_{k+1} =\cdots =R_{k+p} \), and the last inequality is followed from Remark 2.1 and the induction’s hypothesis by considering the fact that \(k+1\,\)is an unsuccessful iteration and \(l\left( {k+1} \right) \le k+1\). Now, as \(l\left( {k+1} \right) \le k+1\le k+p\), we conclude that

$$\begin{aligned} f_{k+p+1} \le \left( {1+\varphi _{k+p} } \right) \left| {f_0 } \right| \mathop \prod \limits _{i=0}^{k+p-1} \left( {1+\varphi _i } \right) -\omega _{k+p} \le \left| {f_0 } \right| \mathop \prod \limits _{i=0}^{k+p} \left( {1+\varphi _i } \right) -\omega _{k+p} . \end{aligned}$$

Case 2  Let \(k\,=\,0\in J.\) Due to Lemma 3.2, there exists the smallest positive integer \(p\), so that \(p\) is a successful iteration. In this case, we start the first step of the induction by \(k=p\) (strong induction). The rest of the proof is similar to Case 1. \(\square \)

Corollary 3.1

For all \(k\), one has:

$$\begin{aligned} f_{k+1} \le \left| {f_0 } \right| \mathop \prod \limits _{i=0}^k \left( {1+\varphi _i } \right) . \end{aligned}$$

Proof

From (20), it is easily seen that the statement is true for all \(k\in I\). Now, let \(k\in J\). Consider two following cases:

Case 1  Assume that there exists at least a successful iteration before iteration \(k\). Then, from Lemma 3.3, there exists the smallest positive integer \(p\,\)so that \(k-p\in I\). This implies that:

$$\begin{aligned} f_{k-p+1} \le \left| {f_0 } \right| \mathop {\prod }\nolimits _{i=0}^{k-p} \left( {1+\varphi _i } \right) \!. \end{aligned}$$

Moreover, we have \(f_{k-p+1} =\cdots =f_k =f_{k+1}\). Therefore, using (12), we obtain:

$$\begin{aligned} f_{k+1} \le \left| {f_0 } \right| \mathop \prod \limits _{i=0}^k \left( {1+\varphi _i } \right) \!. \end{aligned}$$

Case 2  Assume that there is no successful iteration before iteration \(k\). In this case, we have: \(f_0 =f_1 =\cdots =f_{k+1} \). Therefore, using (12), we obtain:

$$\begin{aligned} f_{k+1} =f_0 \le \left| {f_0 } \right| \mathop {\prod }\limits _{i=0}^k \left( {1+\varphi _i } \right) . \end{aligned}$$

This completes the proof of the corollary. \(\square \)

Lemma 3.4

For the sequence \(\left\{ {x_k } \right\} \), generated by Algorithm 2.1, one has \(\left\{ {x_k } \right\} \subseteq \Omega \,\).

Proof

We proceed by induction on \(k\). For \(k=0\), the result is trivial from Assumption A1. Assume that \(x_k \in \Omega \) (induction hypothesis). We show that \(x_{k+1} \in \Omega \). Using Corollary 3.1 and the inequality between the geometric and arithmetic means, we have

where the last two inequalities are followed from (12), (13) and the fact that the sequence \(\left\{ {\left( {1+\frac{\eta }{k+1}} \right) ^{k+1}} \right\} \) is an increasing sequence converging to \(e^{\eta }\). Therefore, \(x_{k+1} \in \Omega \), and the proof is completed. \(\square \)

Lemma 3.5

Let \(q_k \) satisfy (6) and

$$\begin{aligned} \mathop {\lim }\limits _{k\rightarrow \infty } \frac{-g_k^T q_k }{\parallel q_k \parallel }=0. \end{aligned}$$

Then, we have

$$\begin{aligned} \mathop {\lim }\nolimits _{k\rightarrow \infty } \parallel g_k \parallel =0. \end{aligned}$$

Proof

Using (6), we have

$$\begin{aligned} 0\le \mathop {\lim }\limits _{k\rightarrow \infty } \tau \parallel g_k \parallel \le \mathop {\lim }\limits _{k\rightarrow \infty } \frac{-g_k^T q_k }{\parallel q_k \parallel \parallel g_k \parallel }\parallel g_k \parallel =\mathop {\lim }\limits _{k\rightarrow \infty } \frac{-g_k^T q_k }{\parallel q_k \parallel }=0. \end{aligned}$$

This completes the proof of the lemma.\(\square \)

Lemma 3.6

Assume that the index set \(I\) is infinite and is denoted by \(\left\{ {k_0 ,k_1 ,k_2 ,\ldots } \right\} \). Then, for each successful iteration \(k_j \), there exist a nonnegative integer \(L\) and \(0\le \,r\le M-1\), so that \(k_j =k_{LM+r} \) and

where \(\tilde{\omega }_i =\mathop {\min }\nolimits _{0\le r\le M-1} \omega _{k_{iM+r}}\).

Proof

See Appendix. \(\square \)

The following theorem states the global convergence property of Algorithm 2.1.

Theorem 3.1

Suppose that assumptions A1 and A2 hold. Then, Algorithm 2.1 either stops at a stationary point of the problem (1) or generates an infinite sequence \(\{x_k \}\) so that

$$\begin{aligned} \mathop {\lim }\nolimits _{k\rightarrow \infty } \mathrm{inf }\parallel g_k \parallel =0. \end{aligned}$$

Proof

Suppose that Algorithm 2.1 does not stop at a stationary point. We show that

$$\begin{aligned} \mathop {\lim }\limits _{k\rightarrow \infty } \hbox { inf }\parallel g_k \parallel =0. \end{aligned}$$

By contrary, assume that there exists a positive constant \(\delta \) so that, for all \(k\),

Using Lemma 3.6 and (23), we have

$$\begin{aligned} \mathop \sum \limits _{i=0}^L \tilde{\omega }_i&\le \left| {f_0 } \right| \mathop \prod \limits _{i=0}^{k_{LM+r} -1} \left( {1+\varphi _i } \right) -f_{k_{LM+r} } \\ \,&\le e^{\eta }\left| {f_0 } \right| -f_{k_{LM+r} } . \\ \end{aligned}$$

Thus, as \(L\rightarrow \infty \), Assumption A1 implies that

Due to the definition of \(\tilde{\omega }_i \), there exists \(0\le \bar{r} \le M-1\), so that \(\tilde{\omega }_i =\omega _{k_{iM+\bar{r}}}\). Letting \(\beta _i =k_{iM+\bar{r}} \), from Lemma 3.3, we have

$$\begin{aligned} \tilde{\omega }_i =\omega _{\beta _i } =\theta \mu _2 \min \left\{ {\parallel g_{\beta _i } \parallel \Delta _{\beta _i } ,\frac{\parallel \,g_{\beta _i } \parallel ^{2}}{\parallel B_{\beta _i } \parallel }} \right\} \rightarrow 0,\,\hbox { as }i\rightarrow \infty . \end{aligned}$$

Using (25), (26) and Assumption A2, we conclude that

Therefore, we have \(\Delta _{\beta _i } =\min \{v_{\beta _i } s_{\beta _i } \parallel q_{\beta _i } \parallel ,\,\Delta _{\mathrm{max}} \}=v_{\beta _i } s_{\beta _i } \parallel q_{\beta _i } \parallel \). On the other hand, using Cauchy–Schwartz inequality, we have

$$\begin{aligned} 0&\le \frac{v_{\beta _i } }{2m_1 +1}\left( {\frac{g_{\beta _i }^T q_{\beta _i } }{\parallel q_{\beta _i } \parallel }} \right) ^{2}\le v_{\beta _i } \frac{\left( {g_{\beta _i }^T q_{\beta _i } } \right) ^{2}}{q_{\beta _i }^T \hat{B}_{\beta _i } q_{\beta _i } }=v_{\beta _i } s_{\beta _i } \left( {-g_{\beta _i }^T q_{\beta _i } } \right) \\&\le v_{\beta _i } s_{\beta _i } \parallel q_{\beta _i } \parallel \parallel \,g_{\beta _i } \parallel =\Delta _{\beta _i } \parallel \,g_{\beta _i } \parallel , \end{aligned}$$

where the second inequality is obtained from Remark 3.2. Now, Lemma 3.5 and (25) imply that

$$\begin{aligned} \mathop {\lim }\limits _{i\rightarrow \infty } \frac{-g_{\beta _i }^T q_{\beta _i } }{\parallel q_{\beta _i } \parallel }\ne 0. \end{aligned}$$

Therefore, the inequality

$$\begin{aligned} 0\le \frac{v_{\beta _i } }{2m_1 +1}\left( {\frac{g_{\beta _i }^T q_{\beta _i } }{\parallel q_{\beta _i } \parallel }} \right) ^{2}\le \parallel \,g_{\beta _i } \parallel \Delta _{\beta _i } , \end{aligned}$$

together with (25) and the fact that \(\nabla f(x)\) is continuous over the compact set \(\Omega \), by Assumption A1, imply that

Now, we have:

$$\begin{aligned} \left| {\frac{f\left( {x_{\beta _i } } \right) -f\left( {x_{\beta _i } +d_{\beta _i } } \right) }{Pred_{\beta _i } }-1} \right|&= \left| {\frac{f\left( {x_{\beta _i } } \right) -f\left( {x_{\beta _i } +d_{\beta _i } } \right) -Pred_{\beta _i } }{Pred_{\beta _i } }} \right| \\&\le \frac{O\left( {\parallel d_{\beta _i } \parallel ^{2}} \right) }{\theta \mu _2 \parallel \,g_{\beta _i } \parallel \min \left\{ {\Delta _{\beta _i } ,\frac{\parallel \,g_{\beta _i } \parallel }{\parallel B_{\beta _i } \parallel }} \right\} } \\&= \frac{O\left( {\parallel d_{\beta _i } \parallel ^{2}} \right) }{O(\Delta _{\beta _i } )}\le \frac{O\left( {\parallel d_{\beta _i } \parallel ^{2}} \right) }{O\left( {\parallel d_{\beta _i } \parallel ^{2}} \right) }\mathop {\rightarrow }\limits ^{i\rightarrow \infty } 0, \end{aligned}$$

where the first inequality is obtained from Lemma 3.1 and Remark 3.3. Therefore,

$$\begin{aligned} r_{\beta _i } =\frac{\left( {1+\varphi _{\beta _i } } \right) R_{\beta _i } -f\left( {x_{\beta _i } +d_{\beta _i } } \right) }{Pred_{\beta _i } }\ge \frac{f\left( {x_{\beta _i } } \right) -f\left( {x_{\beta _i } +d_{\beta _i } } \right) }{Pred_{\beta _i } }\rightarrow 1. \end{aligned}$$

Thus, there exists a positive constant \(v^{*}\), so that, for sufficiently large \(\beta _i \in I\), we have \(v_{\beta _i } \ge v^{*},\) which is a contradiction with (28). This completes the proof of the theorem. \(\square \)

4 Convergence rate analysis

In this section, we provide the superlinear convergence rate of Algorithm 2.1. The following theorem sates the requirements for holding the superlinear convergence rate of the sequence generated by Algorithm 2.1.

Theorem 4.1

Let Assumptions A1 and A2 hold and \(d_k=-B_k ^{-1}g_k \) be a solution of the subproblem (2). Moreover, suppose that \(\left\{ {x_k } \right\} \) is the sequence generated by Algorithm 2.1, which converges to a stationary point \(x^{*}\). Also, assume that \(\nabla ^{2}f\left( {x^{*}} \right) \) is a positive definite matrix and \(\nabla ^{2}f\left( x \right) \) is a Lipschitz continuous in a neighborhood of \(x^{*}\). If \(B_k \) satisfies the following condition:

then, the sequence \(\left\{ {x_k } \right\} \) converges to the point \(x^{*}\) superlinearly.

Proof

The proof is similar to the proof of Theorem 4.1 in [6]. \(\square \)

5 Numerical results

In this section, we present and compare the computational results of applying two versions of Algorithm 2.1, based on two nonmonotone terms given by (10) and (11), and some other existing algorithms. For ease of reference, we call Algorithm 2.1 with the nonmonotone terms (10) and (11) as RNATR-Z and RNATR-A, respectively. All of test problems are taken from Andrei [3] and Moré et al. [16]. In the considered TR methods, \(q_k \) has a wide scope for choosing which only needs to satisfy (4). Two popular choices of \(q_k \) are \(q_k =-g_k \), which is a natural choice, and \(q_k =-B_k^{-1} g_k \), which has some interesting properties in theory and in practice, see e.g. [1, 23]. In order to compare the efficiency of the new proposed adaptive radius, we use both above mentioned \(q_k \)’s, but in numerical results in terms of number of iterations and function evaluations, we just focus on the case \(q_k =-g_k \) in order to save the computational costs in large scale problems.

We have implemented algorithms RNATR-Z and RNATR-A along with the following algorithms in MATLAB 7.10.0 (R2010a) environment and run the problems on a PC with CPU 2.0 GHz and 2GB RAM memory and double precision format:

NATSG::

NMATR method proposed in [1] with \(q_k =-g_k \).

NATSH::

NMATR method proposed in [1] with \(q_k =-B_k^{-1} g_k \).

NATFG::

Algorithm 2.1 with \(q_k =-g_k \) and the same nonmonotone term of NMATR.

NATFH::

Algorithm 2.1 with \(q_k =-B_k^{-1} g_k \) and the same nonmonotone term of NMATR.

NBTR::

Classical TR algorithm with the same nonmonotone term of NMATR;

NATR-Z::

Algorithm 2.1 with the nonmonotone term as proposed in [29] with \(\vartheta _0 =0.85\), \(\varphi _k =0\), and \(q_k =-g_k \);

NATR-A::

Algorithm 2.1 with the same nonmonotone term of NMTR-N1, as proposed in [2], \(\varphi _k =0\), and \(q_k =-g_k \);

The following parameters are set in the related algorithms:

$$\begin{aligned} \mu _1&= 0.9, \quad \mu _2 =0.1, \,\sigma _0 =\frac{1}{8}, \,\sigma _1 =6, \,\rho =0.5, \,v_0 =0.1, \,v_{\mathrm{max}} =10^{5},\\ \Delta _{\mathrm{max}}&= 100. \end{aligned}$$

For the problem of size \(n\), we set \(M=10\) whenever \(n\ge 5\), otherwise we set \(M=2n\). Moreover, we set \(\eta _0 =10^{4}\), and

$$\begin{aligned} \eta _k =\frac{10^{3}}{k^{2}}, \quad k=1,2,\ldots \end{aligned}$$

As proposed in [15], for the NBTR algorithm, we set the initial radius to be \(\Delta _0 =0.1\parallel \,g_k \parallel \) and the consequence radii are updated according to the following formula:

$$\begin{aligned} \Delta _{k+1} =\left\{ {{\begin{array}{lc} \min \left\{ {c_1 \parallel d_k \parallel ,\,\Delta _k } \right\} &{}r_k \ge \,\mu _1 , \\ \Delta _k &{}\,\qquad \mu _2 \le r_k <\,\mu _1 , \\ c_0 \,\parallel d_k \parallel &{}\,\hbox {Otherwise}, \\ \end{array} }} \right. \end{aligned}$$

where \(c_0 =0.25\) and \(c_1 =2.5\). The matrix \(B_k \) is being updated by the BFGS formula [17] as below:

$$\begin{aligned} B_{k+1} =B_k +\frac{y_k y_k^T }{s_k^T y_k }-\frac{B_k s_k s_k^T B_k }{s_k^T B_k s_k }, \end{aligned}$$

where \(s_k =x_{k+1} -x_k \) and \(y_k =g_{k+1} -g_k \). Note that, the matrix \(B_k \) is being updated as long as the curvature condition holds, i.e. \(s_k^T y_k >0\); otherwise we set \(B_{k+1} =B_k \). For proper comparison, we provide all codes in the same subroutine and solve the trust region subproblems by Steihaug–Toint procedure, see e.g. page 205 in [5].

Numerical results are given in Tables 1 and S1 (see on-line supplementary material). In these tables, \(n\), \(n_i \,\)and \(n_f \) represent the problem size, the number of iterations and the number of function evaluations, respectively. For the results of Table 1, all the considered algorithms are being stopped when \(\parallel \,g_k \parallel \le 10^{-8}\). Moreover, for the results of Table S1 (see on-line supplementary material), all the considered algorithms are being stopped whenever \(\parallel \,g_k \parallel \le 10^{-6}\parallel \,g_0 \parallel \). We declare that an algorithm is failed whenever the number of iteration and the number of function evaluations exceed 10,000 and 20,000, respectively. It is worth mentioning that the number of iterations and gradient evaluations are the same in the considered algorithms. Moreover, we have only kept the problems in the tables for which all the considered algorithms converge to the same local solution. We have also utilized the advantages of the performance profile of Dolan and Moré [9] to compare the algorithms. Figures 1 and 2 provide the \(\hbox {log}_2 \) scale performance profile of the considered algorithms based on \(n_i \) and \(n_f \). The left and right figures are drawn in terms of \(n_i \) and \(n_f \), respectively. Note that, for every \(\tau \ge 1\), the performance profile gives the proportion \(P(\tau )\) of test problems in which each considered algorithms has a performance within a factor \(\tau \) of the best; see [9] for more complete discussion.

Table 1 The effect of the new proposed adaptive radius
Fig. 1
figure 1

Performance profile based on \(n_i\) (left) and \(n_f \) (right) for the results of Table 1

Fig. 2
figure 2

Performance profile based on \(n_i\) (left) and \(n_f \) (right) for the results of Table S1 (see on-line supplementary material)

The main focus in the results of Table 1 is to show the efficiency of the new proposed adaptive radius. It can be easily seen from Fig. 1 and Table 1 that the new proposed adaptive radius works very well, especially in the sense that it solves all the considered test problems without any failure, while the other algorithms fail in some problems. In Table S1 (see on-line supplementary material), our aim is to show the efficiency of Algorithm 2.1, especially on large scale problems. As it is clear from Fig. 2, one can easily see that: Firstly, the RNATR-A is the best performed algorithm among the considered algorithms, as it solves about 65 % of test problems in minimum number of \(n_i \) and \(n_f \). Secondly, the performance index of RNATR-A grows up rapidly in comparison with the other considered algorithms. It means that in the cases that RNATR-A is not the best algorithm; its performance index is close to the performance index of the best algorithm. Moreover, by comparing the performance of NATR-Z and RNATR-Z, one can see that the relaxation technique results in a great increase in the number of test problems that are solved in the minimum number of \(n_i \) and \(n_f \). In addition, in contrast to NATR-Z, the RNATR-Z algorithm solves all the considered test problems without any failure. Analogous behaviors can be seen in the performance of NATR-A and RNATR-A. Therefore, the new relaxation technique has a great influence in the performance of the new proposed algorithm. As an illustration of the effect of our proposed relaxation technique, a typical example is shown in Fig. 3, where the sequences \(\left\{ {f_k } \right\} \) are plotted against \(k\) for both nonmonotone and the relaxed nonmonotone techniques.

Fig. 3
figure 3

The value of Extended Rosenbrock function on successive iterations for nonmonotone adaptive trust region methods (NATR-Z and NATR-A), and relaxed nonmonotone adaptive trust region methods (RNATR-Z and RNATR-A) with dimension \(n=\,5{,}000\)

6 Conclusions

In this paper, a new relaxed nonmonotone adaptive trust region method for solving unconstrained optimization problems is presented. The new proposed algorithm incorporates the Shi and Guo’s adaptive trust region method, proposed in [23], with a new nonmonotone term inspired by that proposed in [2]. Under some standard assumptions, theoretical results show that the new algorithm inherits global convergence property of standard TR methods. We have also established the superlinear convergence rate of the new proposed algorithm. Numerical results confirm that the new nonmonotone adaptive strategy provides a powerful tool for the algorithm to perform efficiently in practice, too.