1 Introduction

In this paper we consider the unconstrained minimization problem

$$\begin{aligned} \begin{array}{ll} \text {minimize} &{} \quad f(x)\\ \text {subject to} &{} \quad x\in {\mathbb {R}}^n, \end{array} \end{aligned}$$
(1)

where \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is a real-valued nonlinear function, which is bounded and continuously-differentiable. We suppose that first- or second-order black-box oracle of \(f\) is available.

1.1 Motivation and history

Trust-region methods, also called restricted step methods [22], are a class of iterative schemes developed to solve convex or nonconvex optimization problems, see, for example, [14]. They also developed for nonsmooth problems, see [14, 17, 24, 46]. Trust-region methods have strong convergence properties, are reliable and robust in computation, and can handle ill-conditioned problems, cf. [35, 36]. Let \(x_k\) be the current iteration. In trust-region framework the objective \(f\) is approximated by a simple model in a specific region around \(x_k\) such that it is an acceptable approximation of the original objective, which is called trust-region (region of trust). Afterward, the model is minimized subject to the trust-region constraint to find a new trial point \(d_k\). Hence the simple model means that it can be minimized much easier than the original objective function. If the founded model is an adequate approximation of the objective function within the trust-region, then the point \(x_{k+1}=x_k+d_k\) is accepted by the trust-region method and the region can be expanded for the next iteration; conversely, if the approximation is poor, the region is contracted and the model is minimized within the contracted region. This scheme will be continued until finding an acceptable trial step \(d_k\) guaranteeing an acceptable agreement between the model and the objective function.

Several quadratic and non-quadratic models have been proposed to approximate the objective function in optimization, see [15, 23, 37, 40], however, the conic and quadratic models are becoming more popular, see [18, 37, 38]. If the approximated model is quadratic, i.e.,

$$\begin{aligned} q_k(d) := f_k + g_k^T d + \frac{1}{2} d^T B_k d, \end{aligned}$$
(2)

where \(f_k = f(x_k)\), \(g_k = \nabla f(x_k)\), and \(B_k \approx H(x_k)=\nabla ^2 f(x_k)\), the trust-region method can be considered as a globally convergent generalization on classical Newton’s method. Then the trust-region subproblem is defined by

$$\begin{aligned} \begin{array}{ll} \mathrm {minimize} &{} \quad q_k(d),\\ \mathrm {subject~to} &{} \quad \Vert d\Vert \le \delta _k. \end{array} \end{aligned}$$
(3)

Hence the trust-region is commonly a norm ball \(C\) defined by

$$\begin{aligned} C := \big \{d\in {\mathbb {R}}^n \mid \Vert d\Vert \le \delta _k \big \}, \end{aligned}$$

where \(\delta _k > 0\) is a real number called trust-region radius, and \(\Vert \cdot \Vert \) is any norm in \({\mathbb {R}}^n\), cf. [30]. Since \(C\) is compact and the model is continuous, the trust-region subproblem attains its minimizer on the set \(C\). The most computational cost of trust-region methods relates to minimizing the model over the trust-region \(C\). Hence finding efficient schemes for solving (3) has received much attention during past few decades, see [20, 21, 26, 32, 39]. Once the step \(d\) is computed, the quality of the model in the trust-region is evaluated by a ratio of the actual reduction of objective, \(f_k-f(x_k+d)\), to the predicted reduction of model, \(q_k(0)-q_k(d)\), i.e.,

$$\begin{aligned} r_k = \frac{f_k-f(x_k+d)}{q_k(0)-q_k(d)}. \end{aligned}$$
(4)

For a prescribed positive constant \(\mu _1 \in (0, 1]\), if \(r_k \ge \mu _1\), the model provides a reasonable approximation, the step is accepted, i.e., \(x_{k+1}=x_k+d_k\), and the trust-region \(C\) can be expanded for the next step. Otherwise, the trust-region \(C\) should be contracted by decreasing the radius \(\delta _k\) and the subproblem (3) is solved in the reduced region. This scheme is continued until that the step \(d\) accepted by trust-region test \(r_k \ge \mu _1\). Our discussion can be summarized in the following algorithm:

figure a

In Algorithm 1, it follows from \(r_k \ge \mu _1\) and \(q_k(0) - q_k(d_k) > 0\) that

$$\begin{aligned} f_k - f_{k+1} \ge \mu _1 \big (q_k(0) - q_k(d_k)\big ) > 0, \end{aligned}$$

implying \(f_{k+1} \le f_k\). This means that the sequence of function values \(\{f_k\}\) is monotonically decreasing, i.e., the traditional trust-region method is also called the monotone trust-region method. This feature seems natural for minimization schemes, however, it slows down the convergence of TTR to a minimizer if the objective involves a curved narrow valley, see [1, 28]. To observe the effect of nonmonotonicity on TTR, we study the next example.

Example 1

Consider the two-dimensional Nesterov–Chebyshev–Rosenbrock function, cf. [29],

$$\begin{aligned} f(x_1,x_2) =\frac{1}{4}( x _1 - 1 )^ 2 +( x _2 - 2x_1^2 + 1 )^2 , \end{aligned}$$

where we solve the problem (1) by Newton’s method and TTR with the initial point \(x_0 = (-0.61, -1)\). It is clear that \((1, 1)\) is the optimizer. The implementation indicates that Newton’s method needs 7 iterations and 8 function evaluations, while monotone trust-region method needs 22 iterations and 24 function evaluations. We depict the contour plot of the objective and iteration points as well as a diagram for function values versus iteration attained by these two algorithms in Fig. 1. Figure 1a shows that the iterations of TTR follow the bottom of the valley in contrast to those for Newton’s method that can go up and down to reach the \(\varepsilon \)-solution with the accuracy parameter \(\varepsilon = 10^{-5}\). We see that Newton’s method attains larger step compared with those of TTR. Figure 1b illustrates function values versus iterations for both algorithms showing that the related function values of TTR decreases monotonically, while it is fluctuated nonmonotonically for Newton’s method.

Fig. 1
figure 1

A comparison between Newton’s method and TTR: a illustrates the contour plot of the two-dimensional Nesterov–Chebyshev–Rosenbrock function and iterations of Newton and TTR; b shows the diagram of function values versus iterations

In general the monotonicity may result to the slow iterative schemes for highly nonlinear or badly-scaled problems. To avoid this algorithmic limitation, the idea of nonmonotone strategies has been proposed traced back to the watch-dog technique to overcome the Maratos effect for constrained optimization [13]. To improve the performance of Armijo’s line search, Grippo et al. in 1986 [28] proposed the modified Armijo’s rule

$$\begin{aligned} f(x_k + \alpha _k d_k) \le f_{l(k)} + \sigma \alpha _k g_k^T d_k,~~ k= 0,1,2,\ldots , \end{aligned}$$

with the step-size \(\alpha _{k} >0\), \(\sigma \in (0, 1/2)\), and

$$\begin{aligned} f_{l(k)} = \max _{0 \le j \le m(k)} \big \{f_{k-j}\big \}, \end{aligned}$$
(5)

where \(m(0)=0\), \(m(k)\le \min \{m(k-1)+1, N\} \) for nonnegative integer \(N\). It was shown that the associated scheme is globally convergent, and numerical results reported in Grippo et al. [28] and Toint [41] showed the effectiveness of the proposed idea. Motivated by these results, the nonmonotone strategies has received much attention during past few decades. For example, in 2004, Zhang and Hager in [47] proposed the nonmonotone term

$$\begin{aligned} C_{k}= & {} \left\{ \begin{array}{ll} f_0 &{} \quad \mathrm {if}~ k=0, \\ (\eta _{k-1}Q_{k-1}C_{k-1}+f(x_k))/Q_k&{} \quad \mathrm {if}~ k \ge 1, \end{array} \right. \nonumber \\ Q_{k}= & {} \left\{ \begin{array}{ll} 1 &{} \quad \mathrm {if}~ k=0, \\ \eta _{k-1}Q_{k-1}+1 &{} \quad \mathrm {if}~ k \ge 1, \end{array} \right. \end{aligned}$$
(6)

where \(0 \le \eta _{min} \le \eta _{k-1} \le \eta _{max} \le 1\). Recently, Mo et al. in [31] and Ahookhosh et al. in [3] studied the nonmonotone term

$$\begin{aligned} D_k = \left\{ \begin{array}{ll} f_k &{} \quad \mathrm {if}~ k = 1,\\ \eta _{k}D_{k-1} +(1-\eta _{k})f_k &{} \quad \mathrm {if}~ k \ge 2, \end{array} \right. \end{aligned}$$
(7)

where \(\eta _k \in [ \eta _{min} ,\eta _{max}] \), \( \eta _{min} \in [0,1] \), \( \eta _{max} \in [\eta _{min},1] \). More recently, Amini et al. in [8] proposed the nonmonotone term

$$\begin{aligned} R_k=\eta _k f_{l(k)}+(1-\eta _k)f_k, \end{aligned}$$
(8)

where \(0\le \eta _{min}\le \eta _{max}\le 1\) and \(\eta _{k}\in [\eta _{min},\eta _{max}]\). In all cases it was proved that the schemes are globally convergent and enjoy the better performance compared with monotone ones.

At the same importance of using nonmonotone strategies for inexact line search techniques, the combination of trust-region methods with nonmonotone strategies is interesting. Historically, the first nonmonotone trust-region method was proposed in 1993 by Deng et al. in [16] for unconstrained optimization. Under some classical assumptions, the global convergence and the local superlinear convergence rate were established. Nonmonotone trust-region methods were also studied by several authors such as Toint [42], Xiao and Zho [44], Xiao and Chu [45], Zhou and Xiao [48], Ahookhosh and Amini [2], Amini and Ahookhosh [7], and Mo et al. [31]. Recently, Ahookhosh and Amini in [1] and Ahookhosh et al. in [4] proposed two nonmonotone trust-region methods using the nonmonotone term (8). Theoretical results were reported, and numerical results showed the efficiency of the proposed nonmonotone methods.

1.2 Content

In this paper we propose a trust-region method equipped with two novel nonmonotone terms. More precisely, we first establish two nonmonotone terms and then combine them with Algorithm 1 to construct two nonmonotone trust-region algorithms. If \(k \ge N\), the new nonmonotone terms are defined by a convex combination of the last \(N\) successful function values, and if \(k < N\), either a convex combination of \(k\) successful function values or \(f_{l(k)}\) is used. The global convergence to first- and second-order stationary points is established on some classical assumptions. Moreover, local superlinear and quadratic convergence rates for the proposed methods are studied. Numerical results regarding experiments on some highly nonlinear problems and on 112 unconstrained test problems from the CUTEst test collection [25] are reported indicating the efficiency of the proposed nonmonotone terms.

The remainder of paper is organized as follow. In Sect. 2 we propose new nonmonotone terms and their combination with the trust-region framework. The global convergence of the proposed methods are given in Sect. 3. Numerical results are reported in Sect. 4. Finally, some conclusions are given in Sect. 5.

2 Novel nonmonotone terms and algorithm

In this section we first present two novel nonmonotone terms and then combine them into trust-region framework to introduce two nonmonotone trust-region algorithms for solving the unconstrained optimization problem (1).

We first assume that \(k\) denotes the current iteration and \(N \in \mathbb {N}\) is a constant. The main idea is to construct a nonmonotone term determined by a convex combination of the last \(k\) successful function values if \(k < N\) and by a convex combination of the last \(N\) successful function values if \(k \ge N\). In the other words, we construct new terms using function values collected in the set

$$\begin{aligned} \mathcal {F}_k := \left\{ \begin{array}{ll} \{f_0,f_1, \ldots , f_k\} &{} \quad \mathrm {if}~ k < N,\\ \{f_{k-N+1},f_{k-N+2}, \ldots , f_k\} &{} \quad \mathrm {if}~ k \ge N, \end{array} \right. \end{aligned}$$
(9)

which should be updated in each iteration. To this end, motivated by the term (7), we construct \(\overline{T}_k\) using the subsequent procedure

$$\begin{aligned} \left\{ \begin{array}{lll} \overline{T}_0&{} = f_0 &{} \quad \mathrm {if}~k = 0,\\ \overline{T}_1&{} = \big (1-\eta _{0}\big )f_1 + \eta _{0} f_0 &{} \quad \mathrm {if}~k = 1,\\ \overline{T}_2&{} = \big (1-\eta _{1}\big )f_2 + \eta _{1}\big (1-\eta _{0}\big ) f_1 + \eta _{1}\eta _{0} f_0 &{} \quad \mathrm {if}~k = 2,\\ \vdots &{} \vdots &{} \quad \vdots \\ \overline{T}_{N-1} &{} = \big (1-\eta _{N - 2}\big )f_{N - 1} + \eta _{N - 2}\big (1-\eta _{N - 3}\big )&{}\\ &{}\quad f_{N-2} + \cdots + \eta _{N - 2} \cdots \eta _{0} f_0&{} \quad \mathrm {if}~k = N - 1,\\ \overline{T}_N&{} = \big (1-\eta _{N - 1}\big ) f_N + \eta _{N - 1}\big (1-\eta _{N - 2}\big )&{}\\ &{}\quad f_{N-1} + \cdots + \eta _{N - 1} \cdots \eta _{0} f_0 &{} \quad \mathrm {if}~k = N, \end{array} \right. \end{aligned}$$

where \(\eta _i \in [0,1)\), for \(i = 1 , 2 , \ldots , N\), are some weight parameters. Hence the new term is generated by

$$\begin{aligned} \overline{T}_k : = \left\{ \begin{array}{ll} \big (1-\eta _{k-1}\big ) f_k + \eta _{k-1} \overline{T}_{k-1} &{} \quad \mathrm {if}~ k < N,\\ \big (1-\eta _{k-1}\big )~f_k + \eta _{k-1}\big (1-\eta _{k-2}\big ) &{}\\ \quad f_{k-1} + \cdots + \eta _{k-1} \cdots \eta _{k-N} ~ f_{k-N} &{} \quad \mathrm {if}~ k \ge N, \end{array} \right. \end{aligned}$$
(10)

where \( \overline{T}_0 = f_0\) and \(\eta _i \in [0,1)\) for \(i = 1, 2 , \ldots , k\). To show that \(\overline{T}_k\) is a convex combination of the collected function values \(\mathcal {F}_k\), it is enough to show that the summation of multipliers are equal to unity. For \(k \ge N\), the definition for \(\overline{T}_k\) implies

$$\begin{aligned}&\big (1-\eta _{k - 1}\big ) + \eta _{k - 1}\big (1-\eta _{k - 2}\big ) + \cdots + \eta _{k - 1} \cdots \eta _{k - N - 1} \big (1-\eta _{k - N}\big ) \nonumber \\&\quad +\,\, \eta _{k - 1} \cdots \eta _{k - N} = 1 \end{aligned}$$
(11)

For \(k < N\), a similar summation of the last \(k\) multipliers is equal to one. Therefore, the generated term \(\overline{T}_k\) is a convex combination of the elements of \(\mathcal {F}_k\).

The procedure of defining \(\overline{T}_k\) clearly implies that the set \(\mathcal {F}_k\) should be updated and saved in each iteration. Moreover, \(N(N + 1)/2\) multiplications is required to compute \( \overline{T}_k\). To avoid saving \(\mathcal {F}_k\) and decrease the number of multiplications, we derive a recursive formula for (10). From the definition of \( \overline{T}_k\), for \(k \ge N\), it follows that

$$\begin{aligned} \overline{T}_k - \eta _{k - 1} \overline{T}_{k - 1}&= \big (1-\eta _{k - 1}\big )~f_k + \eta _{k - 1}\big (1-\eta _{k - 2}\big ) ~ f_{k - 1} + \cdots + \eta _{k - 1} \cdots \eta _{k - N} ~ f_{k - N}\\&\quad - \eta _{k - 1} \big (1-\eta _{k - 2}\big )~f_{k - 1} - \cdots - \eta _{k - 1} \cdots \big (1-\eta _{k - N -1}\big )~ f_{k - N}\\&\quad - \eta _{k - 1} \eta _{k - 2} \cdots \eta _{k - N - 1} ~ f_{k - N - 1}\\&= \big (1-\eta _{k - 1}\big )~f_k + \eta _{k - 1} \eta _{k - 2} \cdots \eta _{k - N - 1} ~ \big (f_{k - N} - f_{k - N - 1}\big )\\&= \big (1-\eta _{k - 1}\big )~f_k + \xi _k ~ \big (f_{k - N} - f_{k - N - 1}\big ) \end{aligned}$$

where \(\xi _k : = \eta _{k - 1} \eta _{k - 2} \cdots \eta _{k-N-1}\). For \(k \ge N\), this equation leads to

$$\begin{aligned} \overline{T}_k = \big (1-\eta _{k - 1}\big )~f_k + \eta _{k - 1} \overline{T}_{k - 1} +\xi _k~ \big (f_{k - N} - f_{k - N -1}\big ), \end{aligned}$$
(12)

which requires to save only \(f_{k-N}\) and \(f_{k-N-1}\) and only needs three multiplications. Moreover, the definition of \(\xi _k\) implies

$$\begin{aligned} \xi _k = \eta _{k - 1} \eta _{k - 2} \cdots \eta _{k - N - 1} = \frac{\eta _{k - 1}}{\eta _{k - N - 2}} \eta _{k - 2} \eta _{k - 3} \cdots \eta _{k - N - 2} = \frac{\eta _{k - 1}}{\eta _{k - N - 2}} \xi _{k-1} . \end{aligned}$$
(13)

If \(\xi _k\) is recursively updated by (13), (10), and (12), a new nonmonotone term is defined by

$$\begin{aligned} T_k : = \left\{ \begin{array}{ll} \!\!f_k + \eta _{k - 1} \big ( \overline{T}_k - f_k \big ) &{}\quad \mathrm {if}~ k < N,\\ \!\!\max \left\{ \overline{T}_{k} , f_k \right\} &{}\quad \mathrm {if}~ k \ge N, \end{array} \right. \end{aligned}$$
(14)

where the max term is added to guarantee \(T_k \ge f_k\) .

As discussed in Sect. 1, nonmonotone schemes perform better when they use stronger nonmonotone terms far away from the optimizer and weaker one close to it. This motivate us to consider a new version of the derived nonmonotone term by using \(f_{l(k)}\) in cases that \(k < N\). More precisely, the second nonmonotone term is defined by

$$\begin{aligned} \!T_k = \left\{ \begin{array}{ll} \!f_{l(k)} &{}\quad \mathrm {if}~ k < N,\\ \max \left\{ \overline{T}_{k}, f_k \right\} &{}\quad \mathrm {if}~ k \ge N, \end{array} \right. \end{aligned}$$
(15)

where \(\xi _k\) is defined by (13). It is clear that the new term uses a stronger term \(f_{l(k)}\) defined by (5) for first \(k < N\) iterations and then employs the relaxed convex term proposed above.

Now, to employ the proposed nonmonotone terms in the trust-region framework, it is enough to replace the ratio \(r_k\) (4) by the nonmonotone ratio

$$\begin{aligned} \widehat{r}_k = \frac{T_k-f(x_k+d)}{q_k(0)-q_k(d)}, \end{aligned}$$
(16)

where \(T_k\) is defined by (14) or (15). Hence in trust-region framework we replace (4) by (16). Notice that if  \(\widehat{r}_k \ge \mu _1\), then,

$$\begin{aligned} T_k-f_{k+1} \ge \mu _1\big (q_k(0)-q_k(d_k)\big ) \ge 0. \end{aligned}$$

This implies that \(f_{k+1}\) can be larger than \(f_k\), however, the elements of \(\{f_k\}\) cannot arbitrarily increase, and the maximum increase is controlled by the nonmonotone term \(T_k\). Moreover, the definitions (14) and (15) imply that \(\widehat{r}_k \ge r_k\) increasing the possibility of attaining larger steps for nonmonotone schemes compared with monotone ones.

The above-mentioned discussion leads to the following nonmonotone trust-region algorithm:

figure b

In Algorithm 2, if \(\widehat{r}_k \ge \mu _1\) (Line 7), it is called a successful iteration and if \(\widehat{r}_k \ge \mu _2\) (Line 14), it is called a very successful iteration. In addition, in the algorithm, the loop started from Line 3 to Line 20 is called the outer cycle, and the loop started from Line 7 to Line 12 is called the inner cycle.

3 Convergence analysis

This section concerns with the global convergence to first- and second-order stationary points of the sequence \(\{x_k\}\) generated by Algorithm 2. More precisely, we intend to prove that all limit point \(x^*\) of the sequence \(\{x_k\}\) satisfy the condition \(g(x^*)=0\), and there exists a point \(x^*\) satisfying \(g(x^*)=0\) where \(H(x^*)\) is positive semidefinite. Furthermore, we show that Algorithm 2 is well-defined, which means that the inner cycle of the algorithm will be left after a finite number internal iterations, and then prove its global convergence. Moreover, local superlinear and quadratic convergence rates are investigated under some classical assumptions.

To prove the global convergence of the sequence \(\{x_k\}\) generated by Algorithm 2, we require the following assumptions:

(H1) :

The objective function \(f\) is continuously differentiable and has a lower bound on the upper level set \(L(x_0) = \{x \in {\mathbb {R}}^n \mid f(x) \le f(x_0)\}\).

(H2) :

The sequence \(\{B_k\}\) is uniformly bounded, i.e., there exists a constant \(M>0\) such that

$$\begin{aligned} \Vert B_k\Vert \le M, \end{aligned}$$

for all \(k \in \mathbb {N}\).

(H3) :

There exists a constant \(c>0\) such that the trial step \(d_k\) satisfies \(\Vert d_k\Vert \le c\Vert g_k\Vert \), for all \(k \in \mathbb {N}\).

We also assume that the decrease on the model \(q_k\) is at least as much as a fraction of the decrease obtained by the Cauchy point guaranteeing that there exists a constant \(\beta \in (0,1)\) such that

$$\begin{aligned} q_k(0)-q_k(d_k)\ge \beta \Vert g_k\Vert ~ \min \left\{ \delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert } \right\} , \end{aligned}$$
(17)

for all \(k\). This condition is called the sufficient reduction condition. Inequality (17) implies that \(d_k \ne 0\) whenever \(g_k \ne 0\). It is noticeable that there are several schemes that can solve the trust-region subproblem (3) such that (17) is valid, see, for example, [14, 33].

Lemma 2

Suppose that the sequence \(\{x_k\}\) is generated by Algorithm 2, then

$$\begin{aligned} \big |f_k-f(x_k+d_k)-\big (q_k(0)-q_k(d_k)\big )\big |\le O\big (\Vert d_k\Vert ^2\big ). \end{aligned}$$

Proof

The proof can be found in [13]. \(\square \)

Lemma 3

Suppose that the sequence \(\{x_{k}\}\) is generated by Algorithm 2, then we get

$$\begin{aligned} f_{k}\le T_{k}\le f_{l(k)}, \end{aligned}$$
(18)

for all \(k \in \mathbb {N} \cup \{0\}\).

Proof

For \(k < N\), we consider two cases: (i) \(T_k\) is defined by (14); (ii) \(T_k\) is defined by (15). In Case (i) Lemma 2.1 in [3], \(f_i \le f_{l(k)}\), for \(i = 0, 1, \ldots k \), and the fact that summation of multipliers in \(T_k\) equal to one give the result. Case (ii) is evident from (15).

For \(k \ge N\), if \(T_k = f_k\), the result is evident. Otherwise, since

$$\begin{aligned} \big (1-\eta _{k - 1}\big ) + \eta _{k - 1}\big (1-\eta _{k - 2}\big ) + \cdots + \eta _{k - 1} \cdots \eta _{k - N - 1} \big (1-\eta _{k - N}\big ) + \eta _{k - 1} \cdots \eta _{k - N} = 1, \end{aligned}$$
(19)

the fact that \(f_i \le f_{l(k)}\), for \(i = k - N + 1, \ldots , k\), and (10) imply

$$\begin{aligned} f_k \le T_k&= \big (1-\eta _{k - 1}\big )~f_k + \eta _{k - 1} \big (1-\eta _{k - 2}\big ) ~ f_{k-1} + \cdots + \eta _{k - 1} \cdots \eta _{k - N} ~ f_{k - N}\\&\le \big [\big (1-\eta _{k - 1}\big ) + \eta _{k - 1} \big (1-\eta _{k - 2}\big ) + \cdots + \eta _{k - 1} \cdots \eta _{k - N}\big ] f_{l(k)} = f_{l(k)}, \end{aligned}$$

giving the result. \(\square \)

Lemma 4

Suppose that the sequence \(\{x_k\}\) is generated by Algorithm 2, then the sequence \(\{f_{l(k)}\}\) is decreasing.

Proof

The condition (18) implies that \(T_k \le f_{l(k)}\). If \(x_{k+1}\) is accepted by Algorithm 2, then

$$\begin{aligned} \frac{f_{l(k)}-f\big (x_k+d_k\big )}{q_k(0)-q_k(d_k)}\ge \frac{T_k-f\big (x_k+d_k\big )}{q_k(0)-q_k(d_k)}\ge \mu _1, \end{aligned}$$

leading to

$$\begin{aligned} f_{l(k)}-f\big (x_k+d_k\big )\ge \mu _1 \big (q_k(0)-q_k(d_k)\big ) \ge 0, \quad \mathrm {for~all} ~k \in \mathbb {N}, \end{aligned}$$

implying

$$\begin{aligned} f_{l(k)}\ge f_{k+1}, \quad \mathrm {for~all} ~k \in \mathbb {N}. \end{aligned}$$
(20)

Now, if \(k\ge N\), by using \(m(k+1) \le m(k)+1\) and (20), we get

$$\begin{aligned} f_{l(k+1)}= & {} \max _{0\le j\le m(k+1)}\big \{f_{k-j+1}\big \} \le \max _{0\le j\le m(k)+1}\big \{f_{k-j+1}\big \}\\= & {} \max \big \{f_{l(k)},f_{k+1}\big \}\le f_{l(k)}. \end{aligned}$$

For \(k<N\), it is obvious that \(m(k)=k\). Since, for any \(k\), \(f_k\le f_0\), it is clear that \(f_{l(k)}=f_0\). Therefore, in both cases, the sequence \(\{f_{l(k)}\}\) is decreasing. \(\square \)

Lemma 5

Suppose that (H1) holds and the sequence \(\{x_k\}\) is generated by Algorithm 2, then \(L(x_0)\) involves \(\{x_k\}\).

Proof

The definition of \(T_k\) indicates that \(T_0=f_0\). By induction, we assume that \(x_i \in L(x_0)\), for all \(i=1,2,\ldots ,k\), and then prove that \(x_{k+1} \in L(x_0)\). From (18), we get

$$\begin{aligned} f_{k+1} \le T_{k+1} \le f_{l(k+1)} \le f_{l(k)} \le f_0, \end{aligned}$$

implying that \(L(x_0)\) involves the sequence \(\{x_k\}\). \(\square \)

Corollary 6

Suppose that (H1) holds and the sequence \(\{x_k\}\) is generated by Algorithm 2. Then the sequence \(\{f_{l(k)}\}\) is convergent.

Proof

The assumption (H1) and Lemma 4 imply that there exists a constant \(\lambda \) such that

$$\begin{aligned} \lambda \le f_{k+n}\le f_{l(k+n)}\le \cdots \le f_{l(k+1)}\le f_{l(k)}, \end{aligned}$$

for all \(n \in \mathbb {N}\). This implies that the sequence \(\big \{f_{l(k)}\big \}\) is convergent. \(\square \)

Lemma 7

Suppose that (H1)–(H3) hold and the sequence \(\{x_k\}\) is generated by Algorithm 2, then

$$\begin{aligned} \lim _{k\rightarrow \infty }f(x_{l(k)})=\lim _{k\rightarrow \infty }f_k. \end{aligned}$$
(21)

Proof

The condition (18) and Lemma 7 of [1] imply that the result is valid. \(\square \)

Corollary 8

Suppose (H1)–(H3) hold and the sequence \(\{x_k\}\) is generated by Algorithm 2, then we have

$$\begin{aligned} \lim _{k\rightarrow \infty }T_k=\lim _{k\rightarrow \infty }f_k. \end{aligned}$$
(22)

Proof

From (18) and Lemma 7, the result is obtained. \(\square \)

Lemma 9

Suppose that (H1) and (H2) hold, and the sequence \(\{x_k\}\) is generated by Algorithm 2. Then if \(\Vert g_k\Vert \ge \varepsilon >0\), we have

  1. (i)

    The inner cycle is left after a finite number of inner iterations and Algorithm 2 is well-defined;

  2. (ii)

    For any \(k\), there exists a nonnegative integer \(p\) such that \(x_{k+p+1}\) is a very successful iteration.

Proof

(i) Let \(t_k\) denotes the internal iteration counter in step \(k\), and \(d_k^{t_k}\) and \(\delta _k^{t_k}\) respectively show the solution of the subproblem (3) and the corresponding trust-region radius in the internal iteration \(t_k\). The fact that \(\Vert g_k\Vert \ge \varepsilon >0\), (H2), and (17) imply

$$\begin{aligned} q_k(0)-q_k\big (d_k^{t_k}\big )\ge \beta \Vert g_k\Vert ~\min \left\{ \delta _k^{t_k}, \frac{\Vert g_k\Vert }{\Vert B_k\Vert } \right\} \ge \beta \varepsilon ~ \min \left\{ \delta _k^{t_k}, \frac{\varepsilon }{M} \right\} . \end{aligned}$$
(23)

Then Line 8 of Algorithm 2 implies

$$\begin{aligned} \lim _{t_k \rightarrow \infty } \delta _k^{t_k}=0. \end{aligned}$$

From This, Lemma 2, and (23), we obtain

$$\begin{aligned} |r_k-1|&=\left| \frac{f_k-f\big (x_k+d_k^{t_k}\big )}{q_k(0)-q_k\big (d_k^{t_k}\big )}-1\right| =\left| \frac{f_k-f\big (x_k+d_k^{t_k}\big )-\big (q_k(0)-q_k\big (d_k^{t_k}\big )\big )}{q_k(0)-q_k\big (d_k^{t_k}\big )}\right| \\&\le \frac{O\big (\Vert d_k^{t_k}\Vert ^2\big )}{\beta \varepsilon ~ \min \left\{ \delta _k^{t_k}, \varepsilon /M \right\} }\le \frac{O\big (\big (\delta _k^{t_k}\big )^2\big )}{\beta \varepsilon ~ \min \left\{ \delta _k^{t_k}, \varepsilon /M \right\} }\rightarrow 0 ~~~ (t_k\rightarrow \infty ), \end{aligned}$$

implying that there exists a positive integer \(k_0\) such that for \(k\ge k_0\) we have \(r_k \ge \mu _1\). This and (18) lead to

$$\begin{aligned} \widehat{r}_k = \frac{T_k-f\big (x_k+d_k^{t_k}\big )}{q_k(0)-q_k\big (d_k^{t_k}\big )} \ge \frac{f_k-f\big (x_k+d_k^{t_k}\big )}{q_k(0)-q_k\big (d_k^{t_k}\big )} \ge \mu _1, \end{aligned}$$

implying the inner cycle is left after a finite number of inner iterations, i.e., Algorithm 2 is well-defined.

(ii) Assume that there exists a positive integer \(k\) such that for an arbitrary positive integer \(p\) the point \(x_{k+p+1}\) is not very successful. Hence, for any constant \(p=0,1,2,\ldots \), we get

$$\begin{aligned} \widehat{r}_{k+p}<\mu _2. \end{aligned}$$

The fact that \(\Vert g_k\Vert \ge \varepsilon >0\), (H2), and (17) imply

$$\begin{aligned} T_{k+p} - f\big (x_{k+p} + d_{k+p}\big )&\ge \mu _1 \big (q_{k+p}(0)-q_{k+p}\big (d_{k+p}\big )\big )\nonumber \\&\ge \beta \mu _1 \Vert g_{k+p}\Vert ~\min \left\{ \delta _{k+p}, \frac{\Vert g_{k+p}\Vert }{\Vert B_{k+p}\Vert } \right\} \nonumber \\&\ge \beta \mu _1 \varepsilon ~ \min \left\{ \delta _{k+p}, \frac{\varepsilon }{M} \right\} . \end{aligned}$$
(24)

By using (22) and (24), we can write

$$\begin{aligned} \lim _{p \rightarrow \infty } \delta _{k+p}=0. \end{aligned}$$
(25)

From Lemma 2, (25), and (23), we obtain

$$\begin{aligned} |r_{k+p}-1|&=\left| \frac{f\big (x_{k+p}\big )-f\big (x_{k+p}+d_{k+p}\big )}{q_{k+p}(0)-q_{k+p}\big (d_{k+p}\big )}-1\right| \\&=\left| \frac{f\big (x_{k+p}\big )-f\big (x_{k+p}+d_{k+p}\big )-\big (q_{k+p}(0)-q_{k+p} \big (d_{k+p}\big )\big )}{q_{k+p}(0)-q_{k+p}\big (d_{k+p}\big )}\right| \\&\le \frac{O\big (\Vert d_{k+p}\Vert ^2\big )}{\beta \varepsilon ~ \min \left\{ \delta _{k+p}, \varepsilon /M \right\} }\le \frac{O\big (\delta _{k+p}^2\big )}{\beta \varepsilon ~ \min \left\{ \delta _{k+p}, \varepsilon /M \right\} }\rightarrow 0 ~~~ (p\rightarrow \infty ). \end{aligned}$$

Then, for a sufficiently large \(p\), we get \(r_{k+p} \ge \mu _2\) leading to

$$\begin{aligned} \frac{T_{k+p}-f\big (x_{k+p}+d_{k+p}\big )}{q_{k+p}(0)-q_{k+p}\big (d_{k+p}\big )} \ge \frac{f\big (x_{k+p}\big )-f\big (x_{k+p}+d_{k+p}\big )}{q_{k+p}(0)-q_{k+p}\big (d_{k+p}\big )} \ge \mu _2. \end{aligned}$$

implying \(\widehat{r}_{k+p}\ge \mu _2\), for a sufficiently large \(p\). This contradicts with assumption \(\widehat{r}_{k+p}<\mu _2\) giving the result. \(\square \)

Lemma 9(i) implies that the inner cycle will be left after a finite number of internal iterations, and Lemma 9(ii) implies that if the current iteration is not a first-order stationary point, then at least there exists a very successful iteration point, i.e., the trust-region radius \(\delta _k\) can be enlarged. The next result gives the global convergence of the sequence \(\{x_k\}\) of Algorithm 2.

Theorem 10

Suppose that (H1) and (H2) hold, and suppose the sequence \(\{x_k\}\) is generated by Algorithm 2. Then

$$\begin{aligned} \liminf _{k \rightarrow \infty } \Vert g_k\Vert =0. \end{aligned}$$
(26)

Proof

We consider two cases: (i) Algorithm 2 has finitely many very successful iterations; (ii) Algorithm 2 has infinitely many very successful iterations.

In Case (i), we suppose that \(k_0\) be the largest index of very successful iterations. If \(\Vert g_{k_0+1}\Vert >0\), then Lemma 9(ii) implies that there exists a very successful iteration with larger index than \(k_0\). This is a contradiction to the definition of \(k_0\).

In Case (ii), by contradiction, we assume that there exist constants \(\varepsilon >0\) and \(K>0\) such that

$$\begin{aligned} \Vert g_k\Vert \ge \varepsilon , \end{aligned}$$
(27)

for all \(k\ge K\). If \(x_{k+1}\) is a successful iteration and \(k \ge K\), then by using (H2), (17), and (27), we get

$$\begin{aligned} T_k-f\big (x_k+d_k\big )&\ge \mu _1\big (q_k(0)-q_k(d_k)\big )\nonumber \\&\ge \beta \mu _1 \Vert g_k\Vert ~ \min \left\{ \delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert }\right\} \ge \beta \mu _1 \varepsilon ~ \min \left\{ \delta _k, \frac{\varepsilon }{M} \right\} \ge 0. \end{aligned}$$
(28)

It follows from this inequality and (22) that

$$\begin{aligned} \lim _{k \rightarrow \infty } \delta _k=0. \end{aligned}$$
(29)

Then (H2), (17), and (27) imply

$$\begin{aligned} |r_k-1|&=\left| \frac{f_k-f\big (x_k+d_k\big )}{q_k(0) -q_k(d_k)}-1\right| =\left| \frac{f_k-f\big (x_k+d_k\big ) -\big (q_k(0)-q_k(d_k)\big )}{q_k(0)-q_k(d_k)}\right| \\&\le \frac{O\big (\Vert d_k\Vert ^2\big )}{\beta \varepsilon ~ \min \left\{ \delta _k,\varepsilon /M \right\} } \le \frac{O(\delta ^2)}{\beta \varepsilon ~ \min \left\{ \delta _k,\varepsilon /M \right\} } \rightarrow 0 ~~~ (k\rightarrow \infty ). \end{aligned}$$

This clearly implies that there exists a positive integer \(k_1\) such that for \(k \ge k_1\) we have

$$\begin{aligned} \widehat{r}_k = \frac{T_k-f\big (x_k+d_k\big )}{q_k(0)-q_k(d_k)} \ge \frac{f_k-f\big (x_k+d_k\big )}{q_k(0)-q_k(d_k)} \ge \mu _1, \end{aligned}$$

implying \(x_{k+1} = x_k + d_k\) for \(k \ge k_1\), i.e., the trust-region radius remains fixed (\(\mu _1 \le \widehat{r}_k < \mu _2\)) or will be increased (\(\widehat{r}_k \ge \mu _2\)). Now we construct a subsequence of very successful iterations for \(k \ge k_1\). Form Lemma 9(ii) there exists an integer \(p_1\) such that \(x_{k_1+p_1}\) is very successful. Then we set \(i_1:=k_1+p_1\). Now, from Lemma 9(ii), there exists an integer \(p_2\) such that \(x_{i_1+p_2}\) is very successful, and so we set \(i_2:=i_1+p_2\). By continuing this procedure we get the index set

$$\begin{aligned} \mathcal {I} := \{i_1, i_2, i_3:= i_2+p_3, \cdots \} \end{aligned}$$

such that \(\{x_{\mathcal {I}}\}\) is the subsequence of very successful iterations in which the trust-region radius is increased. Hence, for \(k \ge k_1\), the trust-region radius is fixed or increased in each iteration,which is a contradiction with (29). This implies the result is valid. \(\square \)

Theorem 11

Suppose that (H1) and (H2) hold, and the sequence \(\{x_k\}\) is generated by Algorithm 2. Then

$$\begin{aligned} \lim _{k \rightarrow \infty } \Vert g_k\Vert =0. \end{aligned}$$
(30)

Proof

By contradiction, we assume \(\lim _{k \rightarrow \infty } \Vert g_k\Vert \ne 0\). Hence there exists \(\varepsilon >0\) and an infinite subsequence of \(\{x_k\}\), indexed by \(\{t_i\}\), such that

$$\begin{aligned} \Vert g_{t_i}\Vert \ge 2\varepsilon >0, \end{aligned}$$
(31)

for all \(i \in \mathbb {N}\). Theorem 10 ensures the existence, for each \(t_i\), a first successful iteration \(r(t_i)>t_i\) such that \(\Vert g_{r(t_i)}\Vert < \varepsilon \). We denote \(r_i=r(t_i)\). Hence there exists another subsequence, indexed by \(\{r_i\}\), such that

$$\begin{aligned} \Vert g_k\Vert \ge \varepsilon ~~ \mathrm {for}~~ t_i \le k < r_i,\quad \Vert g_{r_i}\Vert <\varepsilon . \end{aligned}$$
(32)

We now restrict our attention to the sequence of successful iterations whose indices are in the set

$$\begin{aligned} \kappa = \big \{k \in \mathbb {N}\ |\ t_i \le k < r_i \big \}. \end{aligned}$$

Using (32), for every \(k \in \kappa \), (28) holds. It follows from (22) and (28) that

$$\begin{aligned} \lim _{k \rightarrow \infty } \delta _k=0, \end{aligned}$$
(33)

for \(k \in \kappa \). Now, using (H2), (17), and \(\Vert g_k\Vert \ge \varepsilon \), the condition (23) holds, for \(k \in \kappa \). This, Lemma 2, and (33) lead to

$$\begin{aligned} |r_k-1|&=\left| \frac{f_k-f\big (x_k+d_k\big )}{q_k(0)-q_k(d_k)} -1\right| =\left| \frac{f_k-f(x_k+d_k)-\big (q_k(0)-q_k\big (d_k\big )\big )}{q_k(0)-q_k(d_k)}\right| \\&\le \frac{O\big (\Vert d_k\Vert ^2\big )}{\beta \varepsilon ~ \min \left\{ \delta _k, \varepsilon /M \right\} }\le \frac{O\big (\delta _k^2\big )}{\beta \varepsilon \delta _k }\rightarrow 0 ~~~ (k\rightarrow \infty ,~ k\in \kappa ). \end{aligned}$$

Thus, for a sufficiently large \(k+1 \in \kappa \), we get

$$\begin{aligned} f_k-f\big (x_k+d_k\big )&\ge \mu _1\big (q_k(0)-q_k(d_k)\big )\nonumber \\&\ge \beta \mu _1 \Vert g_k\Vert \min \left\{ \delta _k, \frac{\Vert g_k\Vert }{\Vert B_k\Vert }\right\} \ge \beta \mu _1 \varepsilon ~ \min \left\{ \delta _k, \frac{\varepsilon }{M} \right\} . \end{aligned}$$
(34)

The condition (33) implies that \(\delta _k \le \varepsilon /M\). Hence, for a sufficiently large \(k \in \kappa \), we obtain

$$\begin{aligned} \delta _k \le \frac{1}{\beta \mu _1}\big (f_k-f_{k+1}\big ). \end{aligned}$$
(35)

Then (18) and (35) imply

$$\begin{aligned} \big \Vert x_{t_i}-x_{r_i}\big \Vert \le \sum _{j \in \kappa , j=t_i}^{r_i-1}\big \Vert x_{j}-x_{j+1}\big \Vert \le \sum _{j \in \kappa , j=t_i}^{r_i-1} \delta _j \le \frac{1}{\beta \mu _1}\big (f_{t_i}-f_{r_i}\big ) \le \frac{1}{\beta \mu _1}\big (T_{t_i}-f_{r_i}\big ), \end{aligned}$$
(36)

for a sufficiently large \(i\). Now, Corollary 8 implies

$$\begin{aligned} 0 \le \lim _{i \rightarrow \infty } \big \Vert x_{t_i}-x_{r_i}\big \Vert \le \lim _{i \rightarrow \infty } \frac{1}{\beta \mu _1}\big (T_{t_i}-f_{r_i}\big ) = 0, \end{aligned}$$

leading to

$$\begin{aligned} \lim _{i \rightarrow \infty } \big \Vert x_{t_i}-x_{r_i}\big \Vert = 0. \end{aligned}$$

Since the gradient is continuous, we get

$$\begin{aligned} \lim _{i \rightarrow \infty } \big \Vert g_{t_i}-g_{r_i}\big \Vert =0. \end{aligned}$$
(37)

In view of the definitions of \(\{t_i\}\) and \(\{r_i\}\), it is impossible, guaranteeing \(\Vert g_{t_i}-g_{r_i}\Vert \ge \varepsilon \). Therefore, there is no subsequence that satisfies (31) giving the result. \(\square \)

Note that, in view of the theorem in Page 709 of [28], there is no limit point of the sequence \(\{x_k\}\) to be a local maximizer of \(f\).

The next result gives the global convergence of the sequence generated by Algorithm 2 to second-order stationary points. To this end, similar to [16], an additional assumption is needed:

(H4) If \(\lambda _{\min }(B_k)\) represents the smallest eigenvalue of the symmetric matrix \(B_k\), then there exists a positive scalar \(c_3\) such that

$$\begin{aligned} q_k(0)-q_k(d_k) \ge c_3 \lambda _{\min }(B_k) \delta ^2. \end{aligned}$$

Theorem 12

Suppose that \(f\) is twice continuously differentiable and also suppose that (H1)–(H4) hold. Then there exists a limit point \(x^*\) of the sequence \(\{x_k\}\) generated by Algorithm 2 such that \(\nabla ^2 f(x^*)\) is positive semidefinite.

Proof

The proof is similar to Theorem 3.4 in [16]. \(\square \)

The next two results show that Algorithm 2 can be reduced to quasi-Newton or Newton methods, where the sequence \(\{x_k\}\) generated by these schemes can attain local superlinear and quadratic convergence rates under some conditions, respectively.

Theorem 13

Suppose that (H1)–(H3) hold, and also suppose that the sequence \(\{x_k\}\) generated by Algorithm 2 converges to \(x^*\), \(\Vert d_k\Vert =\Vert -B_k^{-1}g_k\Vert \le \delta _k\), \(H(x)=\nabla ^2 f(x)\) is continuous in a neighborhood \(N(x^*,\varepsilon )\) of \(x^*\), and \(B_k\) satisfies

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{\big \Vert \big [B_k-H(x^*)\big ]d_k\big \Vert }{\Vert d_k\Vert }=0. \end{aligned}$$
(38)

then

  1. (i)

    There exists a constant \(k_1\) such that for all \(k \ge k_1\) we have \(x_{k+1} = x_k + d_k\);

  2. (ii)

    The sequence \(\{x_k\}\) generated by Algorithm 2 converges to \(x^*\) superlinearly.

Proof

(i) The condition (38) implies

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{\big \Vert g_k+H(x^*)d_k\big \Vert }{\big \Vert d_k\big \Vert }=0, \end{aligned}$$
(39)

leading to

$$\begin{aligned} d_k=-H(x^*)^{-1} g_k+o\big (\Vert d_k\Vert \big ). \end{aligned}$$

This implies that

$$\begin{aligned} \big \Vert d_k\big \Vert \le \big \Vert H(x^*)^{-1}\big \Vert ~ \big \Vert g_k\big \Vert +o\big (\Vert d_k\Vert \big ). \end{aligned}$$
(40)

Theorem 11 implies that \(\Vert g_k\Vert \rightarrow 0\), as \(k\rightarrow \infty \). This and (40) give

$$\begin{aligned} \lim _{k\rightarrow \infty } \Vert d_k\Vert =0. \end{aligned}$$
(41)

This, (18), and (H2) imply

$$\begin{aligned} |r_k-1|&=\left| \frac{f_k-f\big (x_k+d_k\big )}{q_k(0)-q_k(d_k)} -1\right| =\left| \frac{f_k-f\big (x_k+d_k\big )-\big (q_k(0)-q_k(d_k)\big )}{q_k(0) -q_k(d_k)}\right| \\&\le \frac{O\big (\Vert d_k\Vert ^2\big )}{\beta \varepsilon ~ \min \left\{ \delta _k,\varepsilon /M \right\} } \le \frac{O\big (\Vert d_k\Vert ^2\big )}{\beta \varepsilon ~ \min \left\{ \Vert d_k\Vert ,\varepsilon /M \right\} } \rightarrow 0 ~~~ (k\rightarrow \infty ). \end{aligned}$$

This clearly implies that there exists a positive integer \(k_1\) such that for \(k \ge k_1\) we have \(x_{k+1} = x_k + d_k\).

(ii) From \(d_k = - B_k^{-1} g_k\), we obtain

$$\begin{aligned} \frac{\big \Vert g_k+H_k d_k\big \Vert }{\big \Vert d_k\big \Vert } = \frac{\big \Vert \big [H_k- B_k\big ]d_k\big \Vert }{\big \Vert d_k\big \Vert } \le \frac{\big \Vert \big [H_k-H(x^*)\big ]d_k\big \Vert }{\big \Vert d_k\big \Vert } + \frac{\big \Vert \big [B_k-H(x^*)\big ]d_k\big \Vert }{\big \Vert d_k\big \Vert }. \end{aligned}$$

This and (30) lead to

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{\big \Vert g_k+H_k d_k\big \Vert }{\big \Vert d_k\big \Vert }=0. \end{aligned}$$
(42)

Now Theorem 3.6 in [33] implies that \(\{x_k\}\) generated by Algorithm 2 converges to \(x^*\) superlinearly. \(\square \)

Notice that if \(f\) is thrice continuously differentiable and the upper level set \(L(x_0)\) is bounded, then (H1) implies that \(\Vert \nabla ^3f(x)\Vert \) is uniformly continuous and bounded on the open bounded convex set \(\Omega \) involving \(L(x_0)\). Hence, by using the mean value theorem, there exists a constant \(L>0\) such that \(\Vert \nabla ^3f(x)\Vert \le L\) implying

$$\begin{aligned} \big \Vert H(x)-H(y)\big \Vert \le L\big \Vert x-y\big \Vert , \end{aligned}$$
(43)

for all \(x, y \in \Omega \). This implies that Hessian of \(f\) is Lipschitz continuous. This condition can guarantee the quadratic convergence of the sequence \(\{x_k\}\) generated by Algorithm 2. The details are summarized in the next result.

Theorem 14

Suppose that \(f(x)\) is a twice continuously differentiable function on \({\mathbb {R}}^n\), and all assumptions of Theorem 11 hold. If \(\Vert d_k\Vert = \Vert -H_k^{-1}g_k\Vert \le \delta _k\), and there exists a neighborhood \(N(x^*,\epsilon )\) of \(x^*\) such that \(H(x)\) is Lipschitz continuous on \(N(x^*,\epsilon )\), i.e., there exists \(L\) such that

$$\begin{aligned} \big \Vert H(x)-H(y)\big \Vert \le L \big \Vert x-y\big \Vert , \end{aligned}$$
(44)

then

  1. (i)

    There exists a constant \(k_2\) such that for all \(k \ge k_2\) we have \(x_{k+1} = x_k + d_k\);

  2. (ii)

    The sequence \(\{x_k\}\) generated by Algorithm 2 converges to \(x^*\) quadratically.

Proof

  1. (i)

    By replacing \(B_k\) by \(H_k\) in Theorem 13, we obtain that there exists an integer \(k_2>0\) such that

    $$\begin{aligned} x_{k+1} = x_k - H_k^{-1}g_k, \end{aligned}$$

    for all \(k \ge k_1\).

  2. (ii)

    The condition described in (i) and Theorem 3.5 in [33] give the results.

\(\square \)

4 Numerical experiments

In this section we report numerical results for Algorithm 2 equipped with two novel nonmonotone terms proposed in Sect. 2 for solving unconstrained optimization problems. In our experiments we use several version of Algorithm 2 employing state-of-the-art nonmonotone terms. In details, we consider

  • NMTR-G: Algorithm 2 with the nonmonotone term of Grippo et al. [28];

  • NMTR-H: Algorithm 2 with the nonmonotone term of Zhang and Hager [47];

  • NMTR-N: Algorithm 2 with the nonmonotone term of Amini et al. [8];

  • NMTR-M: Algorithm 2 with the nonmonotone term of Ahookhosh et al. [3];

  • NMTR-1: Algorithm 2 with the nonmonotone term (14);

  • NMTR-2: Algorithm 2 with the nonmonotone term (15).

In the experiments we used 112 test problems of the CUTEst test collections [25] from dimension 2 to 5000, where we ignore test problems with the dimension greater than 5000. All of the codes are written in MATLAB using the same subroutine, and they are tested on 2Hz core i5 processor laptop with 4GB of RAM with the double-precision data type. The initial points are standard ones proposed in CUTEst. All the algorithms use the radius

$$\begin{aligned} \delta _{k+1}= \left\{ \begin{array}{ll} c_1\Vert d_k\Vert &{} \quad {\mathrm {if}}~ {\widehat{r}}_k < \mu _1,\\ \delta _k&{} \quad {\mathrm {if}}~ \mu _1 \le {\widehat{r}}_k < \mu _2, \\ \max \{\delta _k,c_2 \Vert d_k\Vert \} &{} \quad {\mathrm {if}}~ {\widehat{r}}_k \ge \mu _2, \\ \end{array} \right. \end{aligned}$$

where

$$\begin{aligned} \mu _{1}=0.05,\;\mu _{2} = 0.9,\; c_1=0.25,\; c_2=2.5,\; \delta _{0} =0.1\Vert g_k \Vert , \end{aligned}$$

see [27]. In the model \(q_k\) (2), an approximation for Hessian is generated by the BFGS updating formula

$$\begin{aligned} B_{k+1}=B_k + \frac{y_k y_k^T}{s_k^T y_k}-\frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}, \end{aligned}$$

where \(s_k=x_{k+1}-x_k\) and \(y_k=g_{k+1}-g_k\). For NMTR-G, NMTR-N, NMTR-1 and NMTR-2, we set \(N = 10\). As discussed in [47], NMLS-H uses \(\eta _k = 0.85\). On the basis of our experiments, we update the parameter \(\eta _k\) by

$$\begin{aligned} \eta _{k}=\left\{ \begin{array}{ll} \eta _0/2 \qquad &{} \quad {\mathrm {if}}~ k=1, \\ (\eta _{k-1}+\eta _{k-2})/2 &{} \quad {\mathrm {if}}~ k\ge 2, \end{array} \right. \end{aligned}$$

for NMTR-N, NMTR-M, NMTR-1 and NMTR-2, where the parameter \(\eta _0\) will be tuned to get a better performance. To solve the quadratic subproblem (3), we use the Steihaug-Toint scheme [14] (Chapter 7, Page 205) where the scheme is terminated if

$$\begin{aligned} \Vert g(x_k+d)\Vert \le \min \left\{ 1/10, \Vert g_k\Vert ^{1/2}\right\} \Vert g_k\Vert ~~~ \text {or}~~~ \Vert d\Vert =\delta _k. \end{aligned}$$

In our experiments the algorithms are stopped whenever the total number of iterations exceeds 10,000 or

$$\begin{aligned} \Vert g_k\Vert < \varepsilon \end{aligned}$$
(45)

holds with the accuracy parameter \(\varepsilon = 10^{-5}\).

To compare the results appropriately, we use the performance profiles of Dolan and Moré in [19], where the measures of performance are the number of iterations (\(N_i\)), the number of function evaluations (\(N_f\)), and the number of gradient evaluations (\(N_g\)). In the algorithms considered, the number of iterations and gradient evaluations are the same, so we only consider the performance of gradients. It is believed that the computational cost of a gradient is as much as the computational cost three function values, i.e., we further consider the measure \(N_f+3 N_g\). The performance of each code is measured by considering the ratio of its computational outcome versus the best numerical outcome of all codes. This profile offers a tool for comparing the performance of iterative schemes in a statistical structure. Let \(\mathcal {S}\) be a set of all algorithms and \(\mathcal {P}\) be a set of test problems. For each problem \(p\) and solver \(s\), \(t_{p,s}\) is the computational outcome regarding to the performance index, which is used in defining the performance ratio

$$\begin{aligned} r_{p,s}=\frac{t_{p,s}}{\min \big \{ t_{p,s}: s\in \mathcal {S}\big \}}. \end{aligned}$$
(46)

If an algorithm \(s\) is failed to solve a problem \(p\), the procedure sets \(r_{p,s}=r_{\mathrm {failed}}\), where \(r_{\mathrm {failed}}\) should be strictly larger than any performance ratio (46). For any factor \(\tau \), the overall performance of an algorithm \(s\) is given by

$$\begin{aligned} \rho _s(\tau )=\frac{1}{n_p} \text {size}\big \{p \in \mathcal {P}: r_{p,s} \le \tau \big \}. \end{aligned}$$

In fact \(\rho _s(\tau )\) is the probability that a performance ratio \(r_{p,s}\) of the algorithm \(s\in \mathcal {S}\) is within a factor \(\tau \in {\mathbb {R}}^n\) of the best possible ratio. The function \(\rho _s(\tau )\) is a distribution function for the performance ratio. In particular, \(\rho _s(1)\) gives the probability that the algorithm \(s\) wins over all other considered algorithms, and \(\lim _{\tau \rightarrow r_{\mathrm {failed}}}\rho _s(\tau )\) gives the probability of that the algorithm \(s\) solve all considered problems. Hence the performance profile can be considered as a measure of efficiency for comparing iterative schemes. In Figs. 3 and 4, the x-axis shows the number \(\tau \) while the y-axis inhibits \(P(r_{p,s} \le \tau : 1 \le s \le n_s)\).

4.1 Experiments with highly nonlinear problems

In this subsection we give some numerical results regarding the implementation of NMTR-1 and NMTR-2 compared with TTR on some two-dimensional highly nonlinear problems involving a curved narrow valley. More precisely, we consider the Nesterov–Chebyshev–Rosenbrock, Maratos, and NONDIA functions, see, for example, [9]. In Example 1 the Nesterov–Chebyshev–Rosenbrock function is given, and the Maratos and NONDIA functions are given by

$$\begin{aligned} \begin{array}{l} \displaystyle f(x_1,x_2) =x_1 + \theta _1 \big ( x_1^2 + x_2^2 - 1 \big )^ 2 ~~~ (\mathrm {Maratos~ function})\\ \end{array} \end{aligned}$$

and

$$\begin{aligned} \displaystyle f(x_1,x_2) = \big (1-x_2\big )^2 + \theta _2\big ( x_1 - x_2^2\big )^2 ~~~(\mathrm {NONDIA~ function}), \end{aligned}$$

respectively, where we consider \(\theta _1 = 10\) and \(\theta _2 = 100\).

Fig. 2
figure 2

A comparison among NMTR-1, MNTR-2, and TTR: a, c, and e respectively illustrate the contour plots of the two-dimensional Nesterov–Chebysheve–Rosenbrock, Maratos, and NONDIA functions and iterations of NMTR-1, MNTR-2, and TTR; b, d, and f show the diagram of function values versus iterations

Fig. 3
figure 3

Performance profiles of NMTR-1 and NMTR-2 with the performance measures \(N_g\), \(N_f\), and \(Nf+ 3 N_g\): a and b display the number of iterations (\(N_i\)) or gradient evaluations (\(N_g\)); c and d show the number of function evaluations (\(N_f\)); e and f display the hybrid measure \(N_f+3 N_g\)

Fig. 4
figure 4

A comparison among NMTR-G, NMTR-H, NMTR-N, NMTR-M, NMTR-1, and NMTR-2 by performance profiles using the measures \(N_g\), \(N_f\), and \(N_f+3 N_g\): a displays the number of iterations (\(N_i\)) or gradient evaluations (\(N_g\)); b shows the number of function evaluations (\(N_f\)); c and d display the hybrid measure \(N_f+3 N_g\) with \(\tau = 1.5\) and \(\tau = 5.5\), respectively

We solve the problem (1) for these three functions using TTR, NMTR-1, and NMTR-2, and the results regarding the number of iterations and function evaluations are summarized in Table 1. To give a clear view of the behaviour of TTR, NMTR-1, and NMTR-2, we depict the contour plot of the considered functions and iterations obtained by the algorithms in Fig. 2a, c, and e. In all three cases, one can see that NMTR-1 and NMTR-2 need less iterations and function values compared with TTR to solve the problem. Moreover, TTR behaves monotonically and follows the bottom of the associated valley, while NMTR-1 and NMTR-2 fluctuated in the valley.

Table 1 Numerical results for highly nonlinear problems

4.2 Experiments with CUTEst test problems

In this subsection we give numerical results regarding experiments with NMTR-1 and NMTR-2 on the CUTEst test problems (Table 2) compared with NMTR-G, NMTR-H, NMTR-N, and NMTR-M.

To get a better performance from NMTR-1 and NMTR-2, we tune the parameter \(\eta _0\) by testing several fixed values of \(\eta _0\) for both algorithms, where we use \(\eta _0=0.15, 0.25, 0.35, 0.45\). The corresponding versions of the algorithms NMTR-1 and NMTR-2 are denoted by NMTR-1-0.15, NMTR-1-0.25, NMTR-1-0.35, NMTR-1-0.45, NMTR-2-0.15, NMTR-2-0.25, NMTR-2-0.35, and NMTR-2-0.45, respectively. The results are summarized in Fig. 3 for three measures: the number function evaluations; the number gradient evaluations; the mixed measure \(N_f + 3 N_g\). In Fig. 3a, c and e illustrate that the results of NMTR-1, where it produces the best results with \(\eta _{0} = 0.25\). From Fig. 3b, d, and f of , it can be seen that the best results are produced by \(\eta _{0}=0.45\). Hence for NMTR-1 we use \(\eta _{0} = 0.25\) and for NMTR-2 use \(\eta _{0} = 0.45\) in the remainder of our experiments.

We here test NMTR-G, NMTR-H, NMTR-N, NMTR-M, NMTR-1, and NMTR-2 for solving the unconstrained problem (1) and compare the produced results. We illustrate the results of comparison in Fig. 4 by using performance profiles for the measures \(N_g\), \(N_f\), and \(N_f+3 N_g\). In Fig. 4a displays for the number of gradient evaluations, where the best results attained by NMTR-2 and then by NMTR-N with about 63 and 52 % of the most wins, respectively. NMTR-1 is comparable with NMTR-G, NMTR-H, NMTR-N, but its diagram grows up faster than the others, which means its performance is close to the performance of the best method NMTR-2. Figure 4b shows for the number of function evaluations and has a similar interpretation of Fig. 4a, however, NMTR-2 attains about 60 % of the most wins. In Fig. 4c and d display for the mixed measure \(N_f+3 N_g\) with \(\tau = 1.5\) and \(\tau = 5.5\), respectively. In this case NMTR-2 outperforms the others by attaining about 58 % of the most wins, and the others have comparable results, however, the diagrams of NMTR-1 and NMTR-M grow up faster than the others implying that they perform close to the best algorithm NMTR-2.

5 Concluding remarks

In this paper we give some motivation for employing nonmonotone strategies in trust-region frameworks. Then we introduce two new nonmonotone terms and combine them into the traditional trust-region framework. It is shown that the proposed methods are golbally convergent to first- and second-order stationary points. Moreover local superlinear and quadratic convergence are established. Applying these methods on some highly nonlinear test problems involving a curved narrow valley show that they have a promising behaviour compared with the monotone trust-region method. Numerical experiments on a set of test problems from the CUTEst test collection show the efficiency of the proposed nonmonotone methods.

Further research can be done in several aspects. For example, by combining the proposed nonmonotone trust-region methods with various adaptive radius, more efficient trust-region schemes can be derived, see, for example, [2, 7]. The combination of the proposed nonmonotone terms with several inexact line searches such as Armijo, Wolfe, and Goldstein is also interesting, see [6, 8]. The extension of the proposed method for constrained nonlinear optimization could be interesting, especially for nonnegativity constraints and box constraints, see, for example, [1012, 34, 42, 43]. It also could be interesting to employ nonmonotone schemes for solving nonlinear least squares and system of nonlinear equations, see [5] and references therein. Moreover, investigating new adaptive formulas for the parameter \(\eta _k\) can be precious to improve the computational efficiency.