1 Introduction

Trust region methods are powerful techniques for nonlinear optimization that have the ability to incorporate second-order information, without requiring it to be positive definite. They are endowed with strong global convergence properties and have proven to be effective in practice. Although the design and analysis of trust region methods are well established in the absence of noise (or errors), this is not the case when noise is present.

In this paper, we show how to redesign the classical trust region method for unconstrained optimization to handle problems where the objective function, gradient, and (possibly) Hessian, are subject to noise. The modification involves only one change in the algorithm: the ratio of actual/predicted reduction used for step acceptance is relaxed by a term proportional to the noise level. All other aspects of the classical trust region method remain unchanged. We show that, under mild assumptions, the proposed algorithm converges to a neighborhood of stationary points, where the size of the neighborhood is determined by the noise level—which is assumed to be bounded. This analysis is more complex than for line search methods due to the effects of memory encapsulated in the trust region update. Our convergence results do not assume convexity of the objective function but only that it is sufficiently smooth.

Examples of practical optimization applications with bounded noise include those that employ mixed-precision arithmetic [22]; derivative-free problems where derivative information is approximated by finite differences [3, 25, 34]; problems in which the evaluation of the objective function (and gradient) contain computational noise [23]; and problems that involve uncertainty. Example of the latter occur in design optimization when the physical parameters of the system are uncertain, making the objective function stochastic [26]. Other applications employ a combination of high- and low-fidelity models, creating bounded noise in the objective function; see [31] for a survey of optimization problems of this kind.

Although our analysis assumes that noise is bounded, the algorithm will normally be robust under various unbounded noise models, such as Gaussian noise. We illustrate this in Sect. 4.3 through a set of numerical experiments.

This investigation was motivated by numerical experiments performed by the authors that indicated that, although the classical trust region approach often tolerates some noise, it can fail in certain situations. This raises the question of how to best modify the method to avoid failures. The algorithm proposed here is inspired by work on line search methods for unconstrained optimization [2, 35] and equality constrained optimization [28]. In those papers, convergence-to-neighborhood results were derived, but the analysis presented here follows different lines since trust region methods require different proof techniques.

The paper is organized into 5 sections. In the rest of this section, we review relevant literature. In Sect. 2, we describe the problem setting and the proposed trust region algorithm. The main convergence results are presented in Sect. 3. Numerical experiments are summarized in Sect. 4, and Sect. 5 presents final remarks.

1.1 Literature review

The study of nonlinear optimization with errors or noise in the function and gradient has attracted attention in recent years, motivated, e.g., by renewed interest in methods for derivative-free optimization based on finite differences [25, 34], and by applications in machine learning; see [17] for a review of some recent work.

One of the earliest studies of nonlinear optimization in the presence of errors [32] gives a characterization of various types of noise arising in stochastic control, and analyzes the performance of several algorithms. To our knowledge, the first investigation of trust region methods with errors is [12]. That paper establishes global convergence for unconstrained optimization assuming that errors in the gradient diminish at a rate proportional to the norm of the true gradient; this condition is referred to as the “norm test” in [4, 8, 9]. The norm test is incorporated into various algorithms [11, 14, 29] for which convergence and complexity results are established.

Three kinds of assumptions have been made about the nature of the noise in the recent literature:

  1. 1.

    Noise follows some probability distribution [5,6,7, 15, 18, 20, 21, 29].

  2. 2.

    Noise is bounded and cannot be diminished [2, 4, 28, 35] —an assumption made in our analysis.

  3. 3.

    Noise is controlled by the algorithm [1, 8, 11, 13, 19, 30].

The differences in these assumptions led naturally to differences in analytical techniques and in the form of the theoretical results, as discussed below. Various algorithmic modifications have been proposed in order to establish convergence results in the noisy setting, as we now discuss.

Algorithmic Features to Account for Noise. We categorize features introduced to handle noise as follows.

  1. (i)

    Step Control for Stochastic Noise. In [14], an unmodified line search is proposed for well behaved settings. To stabilize the line search algorithms, [29] introduces the a concept of “reliable steps": A step is deemed reliable if the stochastic directional derivative is larger than a quantity that is adjusted at run time. For trust region algorithms, [7, 15, 21] proposes controls on the trust region update, e.g., the norm of the gradient must be larger than a multiple of the norm of the step in order for the trust region to be increased—even if the step made good progress. Some of the classical techniques used in the stochastic gradient descent method (SGD) include the use of a constant step size or diminishing step sizes; see, e.g., [10].

  2. (ii)

    Line Search Relaxation for Non-Diminishing Noise. To avoid steps being incorrectly rejected due to presence noise, [2, 4, 20] proposes to relax the Armijo back-tracking line search by a term proportional to the function evaluation error. In contrast, [35] repeats the iteration (with a new stochastic gradient) when the Armijo line search fails. In this paper, we adapt the relaxation strategy to trust region algorithms. For constrained problems, [28] relaxes line search by a term that is the product of the penalty parameter and the noise level.

  3. (iii)

    Controlled Reduction of Noise. Customized techniques have been developed for problems where the noise level can be controlled. [11] studies sample size selection techniques based on the norm test for machine learning problems; [8] proposes an alternative to the norm test for adaptive sampling methods; [13] ensures that the function and derivative evaluations are accurate enough to guarantee the reliability of the models used for step generation. Other work in this category includes [19, 30].

Analytical Techniques and Theoretical Results. The convergence results in the literature can be grouped into three categories. (i) Assuming that noise follows some probability distribution, stopping time results along with iteration complexity bounds are derived in [5,6,7, 15, 18, 20, 21, 21, 29]. (ii) Assuming that noise is bounded and does not diminish as the iterates approach the solution, convergence-to-neighborhood results are presented in [2, 24, 28, 35], as well as in this paper. (iii) Rate of convergence and complexity bounds for algorithms that gradually diminish the noise level in the function and/or gradient are presented in [1, 8, 11, 13, 19, 30], The analytical approaches (i)–(iii) complement each other and model conditions arising in practical applications.

The following table provides a summary of the noise assumptions and theoretical results mentioned above.

Noise/results

Follow distributions

Bounded, non-diminishing

Controllable

Iteration Compl. and Stopping time

[5,6,7, 15, 18, 20, 21, 29]

  

Conv. to Neighborhood

 

[2, 24, 28, 35], this paper

 

Complexity and rates

  

[1, 8, 11, 13, 19, 30]

2 Problem statement and algorithm

Our goal is to design a trust region method to solve the unconstrained minimization problem

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} f(x), \end{aligned}$$
(1)

in the case when the function f(x) and gradient \(g(x) = \nabla f(x)\) cannot be evaluated exactly. Instead, we have access to noisy observations of the above quantities, which we denote as \({{\tilde{f}}}(x)\), and \({{\tilde{g}}}(x)\). We write

$$\begin{aligned} {{\tilde{f}}}(x) = f(x) + \delta _f(x),\quad \text {and}\quad {{\tilde{g}}}(x) = g(x) + \delta _g(x), \end{aligned}$$
(2)

where the error functions (or noise) \( \delta _f(x)\), \(\delta _g(x)\) are assumed to be bounded, i.e.,

$$\begin{aligned} |\delta _f(x)|\le \epsilon _f, \qquad \Vert \delta _g(x)\Vert \le \epsilon _g, \qquad \ \forall x\in {\mathbb {R}}^n. \end{aligned}$$
(3)

Throughout the paper \(\Vert \cdot \Vert \) stands for the Euclidean norm.

Let us apply a classical trust region method to problem (1). At each iterate, the method constructs a quadratic model

$$\begin{aligned} m_{k}(p)={{\tilde{f}}}(x_k)+\tilde{g}(x_k)^{T} p+\frac{1}{2} p^{T} {{\tilde{B}}}_{k} p, \end{aligned}$$
(4)

and solves the following trust region subproblem for the step \(p_k\):

$$\begin{aligned} \min _{p \in {\mathbb {R}}^{n}} m_{k}(p)\quad \text { s.t. }\Vert p\Vert \le \varDelta _{k}. \end{aligned}$$
(5)

In Eq. 4, \(\tilde{B}_k\) could be defined as a noisy evaluation of the Hessian, a quasi-Newton matrix, or some other approximation. To decide if the step \(p_k\) should be accepted—and if the trust region radius \(\varDelta _k\) should be modified— classical trust region methods employ the ratio of actual to predicted reduction in the objective function, defined as

$$\begin{aligned} \frac{\tilde{f}\left( x_{k}\right) -\tilde{f}\left( x_{k}+p_{k}\right) }{m_{k}(0)-m_{k}\left( p_{k}\right) }. \end{aligned}$$
(6)

This ratio is, however, not adequate in the presence of noise because if \(\varDelta _k\) becomes very small, the numerator can be of order \(\epsilon _f\), while the denominator will be proportional to \(\varDelta _k\). Thus, if \(\varDelta _k<< \epsilon _f\), the ratio (6) may exhibit wild oscillations that can cause the algorithm to perform erratically, as we illustrate in Sect. 4.

To address this issue, we propose the following noise tolerant variant of (6):

$$\begin{aligned} \rho _{k}=\frac{\tilde{f}\left( x_{k}\right) -{{\tilde{f}}}\left( x_{k}+p_{k}\right) +r\epsilon _f}{m_{k}(0)-m_{k}\left( p_{k}\right) +r\epsilon _f} , \end{aligned}$$
(7)

where \(r > 2 \) is a constant specified below. The reason for relaxing both the numerator and denominator in (7) is to be consistent with the classical narrative of trust region methods where a ratio close to 1 is an indication that the model is adequate. An alternative approach would be to relax only the numerator and interpret the condition \(\rho _k > c\) (where \(c>0\) is a constant) as a relaxed Armijo condition of the type studied in [2, 28]. We find the first interpretation to be easier to motivate and to yield tighter bounds in the convergence analysis. We state the algorithm as follows.

Typical values of the parameters are \(c_0= 0.1\), \(c_1=\frac{1}{4}\), \(c_2= \frac{1}{2}\), \(\nu = 2\), but other values can be used in practice. The global convergence result presented in the next section holds if the constant r in (7) is chosen as

$$\begin{aligned} r = 2/{(1-c_2)};\end{aligned}$$
(8)

i.e., \(r=4\) when \(c_2\) is chosen as 1/2. We assume that the step \(p_k\) computed in step 3 yields a decrease in the model \(m_k\) that is at least as large as that given by the Cauchy step (defined below). This provides much freedom in the design of the algorithm, and includes the dogleg and Newton-CG methods, as well as the exact solution of the trust region problem; see, e.g., [27]. We note in passing that lancelot [16] employs a ratio similar (though not identical) to (7) as a heuristic to handle roundoff errors.

In practice it can be useful to increase the trust region radius in Step 8 only if \(\rho _{k}>c_2\) and \(\Vert p_{k}\Vert = \varDelta _{k}\), as this can prevent unnecessary oscillations in the trust region radius. The convergence result presented in the next section can easily be extended to that case, assuming certain technical conditions on the step computation—which are satisfied by the dogleg and Newton-CG methods.

In the case of stochastic noise (as opposed to deterministic errors), one may opt to re-evaluate \({\tilde{f}}(x_k)\) and compute a new gradient approximation \({{\tilde{g}}}_k\) in Step 16, so as not to emphasize the effects of a poor choice of \({{\tilde{g}}}_k\). This variant is also covered by the analysis presented below.

3 Global convergence analysis

In this section, we establish a global convergence result for Algorithm 1 that applies to general objective functions. The proof is based on the observation that, when the gradient is large enough, the trust region radius will eventually become large too, ensuring sufficient descent in the objective function despite the presence of noise. This drives the iteration toward regions where the stationarity measure is small (i.e., comparable to the noise level).

figure a

We begin by establishing a standard requirement on the step computation based on the Cauchy step \(p_{k}^{c}\) for problem (1), which is defined as

$$\begin{aligned} p_{k}^{c}=- \tau _{k} \frac{\varDelta _{k}}{\left\| {{\tilde{g}}}_{k}\right\| } \tilde{g}_{k}, \end{aligned}$$
(9)

where

$$\begin{aligned} \tau _{k}=\left\{ \begin{array}{ll} 1 &{} \text{ if } {{\tilde{g}}}_{k}^{T} {{\tilde{B}}}_{k} {{\tilde{g}}}_{k} \le 0 \\ \min \left( \left\| {{\tilde{g}}}_{k}\right\| ^{3} \big /\left( \varDelta _{k} {{\tilde{g}}}_{k}^{T} {{\tilde{B}}}_{k} {{\tilde{g}}}_{k}\right) , 1\right) &{} \text{ otherwise. } \end{array}\right. \end{aligned}$$
(10)

As is well known (see e.g. [27, Lemma 4.3]), the reduction in the model provided by the Cauchy step satisfies

$$\begin{aligned} m_k(0) - m_k(p_k^c)\ge \frac{1}{2}\left\| {{\tilde{g}}}_{k}\right\| \min \left( \varDelta _{k}, \frac{\left\| {{\tilde{g}}}_{k}\right\| }{\left\| \tilde{B}_{k}\right\| }\right) . \end{aligned}$$
(11)

We assume that the step \(p_k\) computed by Algorithm 1 yields a reduction in the model that is not less than that produced by the Cauchy step, i.e.,

$$\begin{aligned} m_k(0) - m_k(p_k)\ge m_k(0) - m_k(p_k^c)\ge \frac{1}{2}\left\| {{\tilde{g}}}_{k} \right\| \min \left( \varDelta _{k}, \frac{\left\| {{\tilde{g}}}_{k}\right\| }{\left\| \tilde{B}_{k} \right\| }\right) . \end{aligned}$$
(12)

We can now state the assumptions on the problem and the algorithm under which the global convergence results are established.

Assumption 1

The objective function f is Lipschitz continuously differentiable with constant L, i.e.,

$$\begin{aligned} \Vert g(x) - g(y)\Vert < L \Vert x-y\Vert . \end{aligned}$$
(13)

Assumption 2

The error in the function and gradient evaluations is bounded, i.e., (3) holds for some constants \(\epsilon _f, \epsilon _g.\)

We impose no other conditions on the errors, other than boundedness. We could also assume that the algorithm employs a noisy Hessian, but our analysis applies in this case, with the constants \(L_B\) specified below replaced by the bound of the noisy Hessian. Next, we impose a minimal requirement on the Hessian approximations.

Assumption 3

There is a constant \(L_B >0\) such that the matrices \({{\tilde{B}}}_k\) satisfy

$$\begin{aligned} \Vert {{\tilde{B}}}_k\Vert < L_B, \ \forall k. \end{aligned}$$
(14)

There is freedom in the computation of the step \(p_k\), but it must yield Cauchy decrease.

Assumption 4

The step \(p_k\) computed by Algorithm 1 satisfies (12).

This assumption can be relaxed so as to require only a fraction of Cauchy decrease, but we do not do so here to avoid the introduction of more constants. The final requirement is standard.

Assumption 5

The sequence \(\{ {{\tilde{f}}}_k\}\) generated by Algorithm 1 is bounded below.

We now proceed with the analysis.

3.1 Properties of the ratio \(\rho _k\)

We begin by establishing a bound between \(\rho _k\) and 1. From (7), we have

$$\begin{aligned} \begin{aligned} \left| \rho _k-1\right|&=\left| \frac{m_{k}\left( p_{k}\right) -\tilde{f}\left( x_{k}+p_{k}\right) }{m_{k}(0)-m_{k}\left( p_{k}\right) +r\epsilon _f}\right| . \end{aligned} \end{aligned}$$
(15)

By the definition of \({{\tilde{f}}}\) and by Taylor’s Theorem we have

$$\begin{aligned} {{\tilde{f}}}(x_k+p_k)&=f(x_k+p_k) + \delta _f(x_k +p_k) \\&= f\left( x_{k}\right) +g_k^{T} p_{k}+\int _{0}^{1}\left[ g\left( x_{k}+t p_{k}\right) -g_k\right] ^{T} p_{k} d t +\delta _f(x_k+p_k). \end{aligned}$$

With this, by (13), (14), and (3), we obtain

$$\begin{aligned} \begin{aligned}&\left| m_{k}(p_k)-{{\tilde{f}}}(x_k+p_k)\right| \\&\quad = {\left| \delta _g^T p_k + \frac{1}{2} p_k^T {{\tilde{B}}}_k p_k +\int _{0}^{1}\left[ g\left( x_{k}+t p_{k}\right) -g_k\right] ^{T} p_{k} d t +ss \delta _f(x_k) + \delta _f(x_k+{p_k}) \right| } \\&\quad \le \frac{L_B}{2}\Vert {p_k}\Vert ^2+ \frac{L}{2}\Vert {p_k}\Vert ^2 + \epsilon _g\Vert {p_k}\Vert + 2\epsilon _f\\&\quad = M\Vert {p_k}\Vert ^2 + \epsilon _g\Vert {p_k}\Vert +2\epsilon _f \end{aligned}\nonumber \\ \end{aligned}$$
(16)

where

$$\begin{aligned} M = \tfrac{1}{2} (L_B+L). \end{aligned}$$
(17)

By substituting (16) and (12) into (15), we establish the following result.

Lemma 1

If \(\rho _k\) is defined by (7), then for all k,

$$\begin{aligned} \left| \rho _k-1\right| \le \frac{ M \varDelta _k^2 + \epsilon _g \varDelta _k+ 2\epsilon _f}{\frac{1}{2}\Vert {{\tilde{g}}}_k\Vert \min (\varDelta _k,\Vert {{\tilde{g}}}_k\Vert /\Vert \tilde{B}_k\Vert )+r\epsilon _f}. \end{aligned}$$
(18)

This lemma suggests that \(\rho _k\) can be made close to 1 by decreasing \(\varDelta _k\), up until the noise term \(\epsilon _f\) dominates. This assertion will be made more precise below.

3.2 Lower bound on trust region radius

We now show that if \(\varDelta _k\) is very small and the gradient is large compared to the noise \(\epsilon _g\), Algorithm 1 will increase the trust region radius. We recall that r is defined in (8) and that \(\nu > 1\).

Lemma 2

(Increase of Trust Region Radius) Suppose that, at iteration k,

$$\begin{aligned} \Vert {{\tilde{g}}}_k \Vert > r\epsilon _g+{\gamma }, \end{aligned}$$
(19)

for some constant \({\gamma }>0\). Then, if

$$\begin{aligned} \varDelta _k \le {{\bar{\varDelta }}} =: \frac{{\gamma }}{r M }, \end{aligned}$$
(20)

we have that

$$\begin{aligned} \varDelta _{k+1} = \nu \varDelta _k. \end{aligned}$$
(21)

Proof

Since \(r > 2\), we have from (14), (17) and (19) that

$$\begin{aligned} r M> 2 M >\Vert {{\tilde{B}}}_k\Vert \quad \text {and }\quad {\gamma }< \Vert {{\tilde{g}}}_k\Vert , \end{aligned}$$
(22)

and thus

$$\begin{aligned} {{\bar{\varDelta }}}< \Vert {{\tilde{g}}}_k\Vert /\Vert {{\tilde{B}}}_k\Vert . \end{aligned}$$
(23)

Thus, if \(\varDelta _k\le {{\bar{\varDelta }}}\), we have

$$\begin{aligned} \min (\varDelta _k,\Vert {{\tilde{g}}}_k\Vert /\Vert {{\tilde{B}}}_k\Vert ) = \varDelta _k. \end{aligned}$$
(24)

In addition, if \(\varDelta _k\le {{\bar{\varDelta }}}\), we also have

$$\begin{aligned} M \varDelta _k+ \epsilon _g\le M {{\bar{\varDelta }}}+ \epsilon _g = \frac{{\gamma }}{r}+ \epsilon _g =\frac{1}{r}(r\epsilon _g+{\gamma }). \end{aligned}$$
(25)

Substituting (24), (19), (25) and (8) into (18), we have that for all \(\varDelta _k\le {{\bar{\varDelta }}}\)

$$\begin{aligned} \left| \rho _{k}-1\right|&\le \frac{ M \varDelta _k^2 + \epsilon _g \varDelta _k+ 2\epsilon _f}{\frac{1}{2}\Vert {{\tilde{g}}}_k \Vert \varDelta _k+r\epsilon _f}\nonumber \\&< \frac{ M \varDelta ^2_k + \epsilon _g \varDelta _k+ 2\epsilon _f}{\frac{1}{2}(r\epsilon _g+{\gamma })\varDelta _k+r\epsilon _f}\nonumber \\&< \frac{\frac{1}{r}(r\epsilon _g+{\gamma }) \varDelta _k+ 2\epsilon _f}{\frac{1}{2}(r\epsilon _g+{\gamma })\varDelta _k+r\epsilon _f}\nonumber \\&= \frac{2}{r}\nonumber \\&= 1-c_2. \end{aligned}$$
(26)

This implies that \(\rho _k > c_2\), and by step 8 of Algorithm 1 we have that \(\varDelta _{k+1}= \nu \varDelta _k.~~~\) \(\square \)

A consequence of this lemma is that there is a lower bound for the trust region radius if the norm of the noisy gradient remains greater than \(r\epsilon _g\).

Corollary 1

(Lower Bound on Trust Region Radius) Given \({\gamma }>0\), if there exist \(K>0\) such that for all \(k\ge K\)

$$\begin{aligned} \Vert {{\tilde{g}}}_k \Vert > r\epsilon _g+{\gamma }, \end{aligned}$$
(27)

then there exist \(K_0 \ge K \) such that for all \(k\ge K_0\),

$$\begin{aligned} \varDelta _k > \tfrac{1}{\nu }{{\bar{\varDelta }}}= \frac{{\gamma }}{\nu r M }. \end{aligned}$$
(28)

Proof

We apply Lemma 2 for each iterate after K to deduce that, whenever \(\varDelta _k \le \, {{\bar{\varDelta }}}\), the trust region radius will be increased. Thus, there is an index \(K_0\) for which \(\varDelta _k\) becomes greater than \({{\bar{\varDelta }}}\). On subsequent iterates, the trust region radius can never be reduced below \({{\bar{\varDelta }}}/\nu \) (by Step 6 of Algorithm 1) establishing the bound (28). \(\square \)

Remark

In traditional trust region analysis for deterministic (noiseless) optimization, one shows that the trust region radius will not shrink below a certain value that depends on the Lipschitz constant and the norm of the current gradient. However, that analysis does not imply that the trust region will increase beyond a certain threshold, which is required in the presence of noise. We need to show that the trust region eventually becomes large enough with respect to the noise level so that progress can be made. This differentiates our analysis from classical trust region convergence theory.

3.3 Reduction of noisy function

The classical trust region algorithm is monotonic, as it requires a reduction in the objective function when accepting a step. Due to the relaxation in (7), Algorithm 1 can accept steps that increase the noisy function. However, when the iterates are far from the solution, this is not the case. We now show that when the noisy gradient and trust region radius are both large enough, the reduction in the objective is large enough to overcome any increase allowed by (7).

Lemma 3

(Noisy Function Reduction) Suppose that for some \(k>0\)

$$\begin{aligned} \Vert {{\tilde{g}}}_k \Vert > r\epsilon _g+ \gamma \quad \text{ and }\quad \varDelta _k \, {\ge } \, \frac{{{\bar{\varDelta }}}}{\nu } = \frac{\gamma }{\nu rM}, \end{aligned}$$
(29)

where

$$\begin{aligned} \gamma = \eta + \mu , \end{aligned}$$
(30)

with \(\mu >0\) an arbitrarily small constant, and

$$\begin{aligned} \eta = \frac{1}{2}\left( - r\epsilon _g+\beta \right) ,\quad \beta = \sqrt{(r\epsilon _g)^2 + 8\nu r^2\left( \frac{1}{c_0}-1\right) M \epsilon _f }. \end{aligned}$$
(31)

Then, if the step is accepted at iteration k by Algorithm 1, we have

$$\begin{aligned} \tilde{f}\left( x_{k}\right) -{{\tilde{f}}}\left( x_{k}+p_{k}\right) > \frac{c_0}{2\nu r M }\left( \mu \beta + \mu ^2\right) . \end{aligned}$$
(32)

Proof

As argued in (23), \({{\bar{\varDelta }}}= \frac{{\gamma }}{rM}< \frac{\left\| {{\tilde{g}}}_{k}\right\| }{\left\| {{\tilde{B}}}_{k}\right\| } \), and therefore

$$\begin{aligned} \min \left( \varDelta _{k}, \frac{\left\| {{\tilde{g}}}_{k}\right\| }{\left\| \tilde{B}_{k}\right\| }\right) \ge \frac{{\gamma }}{\nu rM}. \end{aligned}$$
(33)

If the step \(p_k\) is accepted, we have from Step 12 of Algorithm 1 that \(\rho _k>c_0\), which by (7) is equivalent to

$$\begin{aligned} \frac{{{\tilde{f}}}\left( x_{k}\right) -\tilde{f}\left( x_{k}+p_{k}\right) +r\epsilon _f}{m_{k}(0)-m_{k}\left( p_{k}\right) +r\epsilon _f}>c_0. \end{aligned}$$
(34)

Thus by (12), (29), (33) and (30)

$$\begin{aligned} {{\tilde{f}}}\left( x_{k}\right) -{{\tilde{f}}}\left( x_{k}+p_{k}\right)>&c_0\left[ m_k(0) -m_k(p_k)\right] + r(c_0-1)\epsilon _f \nonumber \\ \ge&\frac{{c_0}}{2}\left\| {{\tilde{g}}}_{k}\right\| \min \left( \varDelta _{k}, \frac{\left\| {{\tilde{g}}}_{k}\right\| }{\left\| {{\tilde{B}}}_{k}\right\| }\right) + r(c_0-1)\epsilon _f\nonumber \\>&\frac{{c_0}}{2\nu r M }\left( r\epsilon _g + {\gamma }\right) {\gamma }+ r(c_0-1)\epsilon _f\nonumber \\ >&\frac{{c_0}}{2\nu r M }\left( r\epsilon _g + \eta \right) \eta + r(c_0-1)\epsilon _f . \end{aligned}$$
(35)

We now chose \(\eta \) so that the right hand side is positive. We obtain

$$\begin{aligned} \eta \ge \frac{1}{2}\left( - r\epsilon _g+\beta \right) \quad \text {or} \quad \eta \le \frac{1}{2}\left( - r\epsilon _g-\beta \right) \end{aligned}$$

We wish for \(\eta \) to be the smallest positive value satisfying these inequalities, yielding

$$\begin{aligned} \eta = \frac{1}{2}\left( - r\epsilon _g+\beta \right) . \end{aligned}$$
(36)

Substituting this quantity in (35), we have

$$\begin{aligned} {{\tilde{f}}}\left( x_{k}\right) -{{\tilde{f}}}\left( x_{k}+p_{k}\right)&> \frac{{c_0}}{2\nu r M }\left( r\epsilon _g + {\gamma }\right) {\gamma }+ r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left( r\epsilon _g + \eta + \mu \right) (\eta + \mu )+ r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left( r\epsilon _g + \frac{1}{2}\left( - r\epsilon _g+\beta \right) + \mu \right) \left( \frac{1}{2}\left( - r\epsilon _g+\beta \right) + \mu \right) \\&\quad + r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left( r\epsilon _g/2 + \beta /2+ \mu \right) \left( - r\epsilon _g/2+\beta /2 + \mu \right) + r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left[ \left( \beta /2+ \mu \right) ^2 - \left( r\epsilon _g/2\right) ^2\right] + r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left[ (\beta /2)^2+\mu \beta +\mu ^2 - \left( r\epsilon _g/2\right) ^2\right] + r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left[ \frac{\beta ^2- (r\epsilon _g)^2}{4}+\mu \beta +\mu ^2 \right] + r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left[ \frac{(r\epsilon _g)^2 + 8\nu r^2\left( \frac{1}{c_0}-1\right) M \epsilon _f - (r\epsilon _g)^2}{4}+\mu \beta +\mu ^2 \right] \\&\quad + r(c_0-1)\epsilon _f\\&=\frac{{c_0}}{2\nu r M }\left[ { 2\nu r^2\left( \frac{1}{c_0}-1\right) M \epsilon _f}+\mu \beta +\mu ^2 \right] + r(c_0-1)\epsilon _f\\&= r(1-c_0)\epsilon _f + \frac{{c_0}}{2\nu r M }\left( \mu \beta +\mu ^2 \right) + r(c_0-1)\epsilon _f\\&= \frac{c_0}{2\nu r M }\left( \mu \beta + \mu ^2\right) . \end{aligned}$$

\(\square \)

The first inequality (29), together with (30), (31), identify the region where noise does not dominate and progress in the objective function can be guaranteed. The constant \(\mu \) was introduced to ensure that our analysis is meaningful in the case when noise is not present (\(\epsilon _f=\epsilon _g=0\)), as it shows that a decrease in the objective is achieved. Nonetheless, the global convergence results presented below are of interest only when noise is present, so there we essentially absorb \(\mu \) into \(\eta \) by setting \(\mu = \epsilon _g/2\).

To summarize the results obtained so far, Lemma 2 states that when \(\Vert {{\tilde{g}}}_k\Vert \) is large enough, the trust region is either large enough or will eventually be increased to be so. Lemma 3 states that when the gradient and trust region are both large enough, every accepted iterate reduces the noisy objective function by a non-vanishing amount. We show that this drives iterations towards stationary points of the problem.

3.4 Global convergence theorems

Our global convergence results are presented in two parts. The first result states that the iterates visit, infinitely often, a critical region characterized by a small gradient norm. The second result states that after visiting the above critical region for the first time, the iterates cannot stray too far from it, as measured by the objective value.

Theorem 6

(Global Convergence to Critical Region) Suppose that Assumptions 1 through 5 are satisfied. Then, the sequence of iterates \(\{x_k\}\) generated by Algorithm 1 visits infinitely often the critical region \({C}_1\) defined as

$$\begin{aligned} {C}_1 =\left\{ x: \Vert g(x)\Vert \le \left( r+1\right) \epsilon _g + \frac{ \beta }{2} \right\} , \end{aligned}$$
(37)

where r and \(\beta \) are defined in (8), (30), (31), with \(\mu = \epsilon _g/2\), \(\nu > 1\) and M given by (17).

Proof

Assume by way of contradiction that there exist \(K'\) such that for all \(k>K'\)

$$\begin{aligned} \Vert g(x_k)\Vert > \left( r+1\right) \epsilon _g + \frac{\beta }{2}. \end{aligned}$$
(38)

Thus, by (3), definition (31) of \(\eta \), and setting \(\mu = \epsilon _g/2\), we have that for all \(k > K'\)

$$\begin{aligned} \Vert {{\tilde{g}}}(x_k)\Vert&> r \epsilon _g + \tfrac{1}{2}\beta \nonumber \\&= -\tfrac{1}{2}r\epsilon _g+ \tfrac{1}{2}\beta + \tfrac{3}{2}{r\epsilon _g}\nonumber \\&= \eta + r\epsilon _g + \tfrac{1}{2}{r\epsilon _g}\nonumber \\&> \, r\epsilon _g + \eta + \mu \qquad \qquad \text{(since }\ r>1)\nonumber \\&= r\epsilon _g+{\gamma }. \qquad \quad \qquad \text{(by } \text{( }30)) \end{aligned}$$
(39)

We now apply Corollary 1 and deduce that there exist \(K_0\ge K'\), such that for all \(k \ge K_0\),

$$\begin{aligned} \varDelta _k > \frac{{\gamma }}{\nu r M }. \end{aligned}$$
(40)

When a step is not accepted, \(\rho _k< c_0 < c_1\), and Algorithm 1 will reduce the trust region radius. If no step is accepted for all \(k > K_0\), the trust region radius would shrink to zero, contradicting (40). Therefore, there must exist infinitely many accepted steps. Now, by (39), (40) the conditions of Lemma 3 hold, and we deduce that each accepted step \( k'>K_0\) achieves the reduction

$$\begin{aligned} {{\tilde{f}}}\left( x_{k'}\right) -{{\tilde{f}}}\left( x_{k'}+p_{k'}\right) > \frac{c_0}{2\nu r M }\left( \mu \beta + \mu ^2\right) = \frac{c_0}{2\nu r M }\left( \frac{\epsilon _g}{2}\beta +\frac{\epsilon _g^2}{4}\right) . \end{aligned}$$
(41)

Since, as mentioned above, there is an infinite number of accepted steps, we deduce that \( \{{{\tilde{f}}}(x_k)\} \rightarrow - \infty \), contradicting Assumption 5. Therefore, the index \(K'\) defined above cannot exist and we have that (38) is violated an infinite number of times. \(\square \)

The achievable accuracy in the gradient guaranteed in (37) depends on \(\epsilon _g\) and \(\sqrt{\epsilon _f}\), by the definition of \(\beta \). The dependence on \(\epsilon _g\) is evident, while the dependence on \(\sqrt{\epsilon _f}\) is due to the combined (multiplicative) effect of the gradient and the trust region radius bound.

Before stating our next theorem, we prove two simple technical results.

Proposition 1

If Algorithm 1 takes a (nonzero) step at iteration k, then

$$\begin{aligned} {{\tilde{f}}}_{k+1} - {{\tilde{f}}}_k < r(1-c_0)\epsilon _f. \end{aligned}$$
(42)

Proof

If the step is taken, we have from Step 12 of Algorithm 1 that \(\rho _k>c_0\), which by (7) is equivalent to

$$\begin{aligned} \frac{\tilde{f}\left( x_{k}\right) -\tilde{f}\left( x_{k}+p_{k}\right) +r\epsilon _f}{m_{k}(0)-m_{k}\left( p_{k}\right) +r\epsilon _f}>c_0,\end{aligned}$$
(43)

and since \(p_k\) cannot increase the model \(m_k\), we have

$$\begin{aligned} {{\tilde{f}}}\left( x_{k}\right) -{{\tilde{f}}}\left( x_{k}+p_{k}\right)> c_0\left[ m_k(0) -m_k(p_k)\right] + r(c_0-1)\epsilon _f > r(c_0-1)\epsilon _f.\qquad \end{aligned}$$
(44)

\(\square \)

Next, we employ Lemma 2 and obtain the following result.

Corollary 2

(Maintaining Lower Bound on Trust Region Radius) Let \({\gamma }>0\) be defined by (30)–(31), and suppose there exist \(K>0\) and \({\hat{K}} > K\) such that for \(k = K+1, \ldots , {\hat{K}}-1\)

$$\begin{aligned} \Vert {{\tilde{g}}}_k \Vert > r\epsilon _g+{\gamma }, \end{aligned}$$
(45)

and that

$$\begin{aligned} \varDelta _{K+1} \ge \frac{\gamma }{\nu rM} = \frac{\bar{\varDelta }}{\nu }. \end{aligned}$$
(46)

Then for \(k = K+1,\ldots , {\hat{K}}-1\)

$$\begin{aligned} \varDelta _k \ge \frac{\gamma }{\nu rM}=\frac{\bar{\varDelta }}{\nu }. \end{aligned}$$
(47)

Proof

The proof is by induction. Condition (47) holds for \(k= K+1\). We show that if (47) it holds for some \(k \in \{K+1,\ldots , {\hat{K}}-2\}\), then it holds for \(k+1\).

Specifically, suppose that for such k we have that

$$\begin{aligned} \varDelta _k \ge \frac{\gamma }{\nu rM}. \end{aligned}$$
(48)

By Lemma 2, if \(\varDelta _k \le \frac{\gamma }{rM}\), the trust region radius is increased, i.e.,

$$\begin{aligned} \varDelta _{k+1} = \nu \varDelta _k \ge \frac{\gamma }{ rM}>\frac{\gamma }{\nu rM}. \end{aligned}$$
(49)

If on the other hand \(\varDelta _k > \frac{\gamma }{rM}\), the trust region radius could be decreased, but in that case

$$\begin{aligned} \varDelta _{k+1} \ge \frac{\varDelta _k}{\nu }>\frac{\gamma }{\nu rM}. \end{aligned}$$
(50)

\(\square \)

The next theorem shows that after an iterate has entered the neighborhood \(C_1\) defined in Theorem 6, all subsequent iterates cannot stray too far away in the sense that their function values remain within a band of the largest function value in \(C_1\).

Theorem 7

(Iterates Remain in the Level Set \(C_2\)) Suppose that Assumptions 1 through 5 are satisfied. Then, after the iterates \(x_k\) generated by Algorithm 1 visit \(C_1\) for the first time, they never leave the set \(C_2\) defined as

$$\begin{aligned} {C}_2 =\left\{ x: f(x) \le \sup _{y\in C_1} f(y) + 2\epsilon _f+ \max [G, r(1-c_0)\epsilon _f ]\right\} , \end{aligned}$$
(51)

where

$$\begin{aligned} G =\left[ (r+1) \epsilon _g + \gamma + \frac{ {\nu ^2} L \gamma }{(\nu - 1 )rM} \right] \frac{ \nu ^2 \gamma }{(\nu - 1)rM}, \end{aligned}$$
(52)

and \(\gamma \) is defined in (30)–(31) with \(\mu ={\epsilon _g}/2 \).

Proof

The proof is based on the observation that, when the iterates leave \(C_1\), if the trust region is large enough, then by Lemma 3 the noisy objective function starts decreasing immediately (Case 1); otherwise the smallness of the trust region limits the increase in the objective function before the trust region becomes large enough to ensure descent (Case 2). We now state this precisely.

Suppose that the \(K^{th}\) step is an exiting step, i.e., \(x_{K}\in C_1\) and \(x_{K+1}\notin C_1\). We let \({\hat{K}} > K+1\) be the index of the first iterate that returns to \(C_1\). Such a \({\hat{K}}\) exists due to Theorem 6. We will prove that all iterates \(x_k\) with \(k \in \{K+1,\ldots ,{\hat{K}}-1\}\) are contained in \(C_2\).

Since \(x_k\notin C_1\) for \(k \in \{K+1,\ldots ,{\hat{K}}-1\}\), we have by (37) that

$$\begin{aligned} \Vert g_k\Vert > \left( r+1\right) \epsilon _g + \frac{\beta }{2} , \end{aligned}$$
(53)

and we have seen in (38)–(39) that this implies that

$$\begin{aligned} \Vert \tilde{g}_k\Vert > r\epsilon _g+{\gamma }, \qquad \ k \in \{K+1, \ldots ,{\hat{K}}-1\}. \end{aligned}$$
(54)

Also, we know that a step was taken at iterate K since \(x_{K}\in C_1\) and \(x_{K+1}\notin C_1\), and thus applying Proposition 1 yields

$$\begin{aligned} {{\tilde{f}}}_{K+1} - {{\tilde{f}}}_K < r(1-c_0)\epsilon _f . \end{aligned}$$
(55)

We divide the rest of the proof according to the size of \(\varDelta _{K+1}\) relative to \({{\bar{\varDelta }}}\), which is defined in (20), i.e.,

$$\begin{aligned} {{\bar{\varDelta }}} = \frac{\gamma }{rM}. \end{aligned}$$
(56)

Case 1: Suppose \(\varDelta _{K+1} \ge {{\bar{\varDelta }}}\). By (54) and the fact that \(\nu > 1\), the conditions of Corollary 2 are satisfied and thus \(\varDelta _k > \frac{\gamma }{\nu rM}\), for \(k= K+1, \ldots , {\hat{K}}-1\). We can therefore apply Lemma 3, with \(\mu = \epsilon _g/2>0\), for each iterate \(k = K+1,\dots ,{\hat{K}}-1\) to yield

$$\begin{aligned} {{\tilde{f}}}(x_{K+1}) \ge {{\tilde{f}}}(x_{K+2}) \ge \cdots \ge {{\tilde{f}}}(x_{{\hat{K}}}). \end{aligned}$$
(57)

Combining this result with (55) we obtain

$$\begin{aligned} {{\tilde{f}}}_k \le {{\tilde{f}}}_{K+1} < {{\tilde{f}}}_K + r(1-c_0)\epsilon _f, \quad k = K+1,..,{\hat{K}}. \end{aligned}$$
(58)

Since \( x_K\in C_1 \) and by (3), we conclude that for \(k = K+1,\ldots ,{\hat{K}}\),

$$\begin{aligned} f_k < f_K +[2+ r(1-c_0)]\epsilon _f \le \sup _{y\in C_1} f(y) +[2+ r(1-c_0)]\epsilon _f. \end{aligned}$$
(59)

Therefore, the inequality in (51) is satisfied in this case.

Case 2: Suppose \( \varDelta _{K+1} < {{\bar{\varDelta }}}\). We begin by considering the increase in the function value while the trust region remains less than \({{\bar{\varDelta }}}.\) To this end, we define

(60)

where denotes the ceiling operation. Since the trust region radius is increased by a factor of at most \(\nu \), we have that l is the minimum number of steps required for the trust region radius to increase from \(\varDelta _{K+1}\) to (at least) \({{\bar{\varDelta }}}\). Now, if \( K+ l > {\hat{K}}\), then the iterates return to \(C_1\) before the trust region becomes at least \({{\hat{\varDelta }}}\). Therefore, the number of out-of-\(C_1\) iterations taken by the algorithm while \(\varDelta _k < {{\hat{\varDelta }}}\) is

$$\begin{aligned} {{\hat{l}} = \min \{l - 1, {\hat{K}} - K -1\}.} \end{aligned}$$
(61)

The increase in function values for iterations indexed by \(k = K+1, \ldots , K+{\hat{l}} + 1\) is bounded as follows:

$$\begin{aligned} | f(x_{k }) - f(x_{K})|&\le \sum _{i = 0}^{k-K-1} | f(x_{K+1+i}) - f(x_{K+i})| \nonumber \\&\le \sum _{i = 0}^{{{\hat{l}}}} | f(x_{K+1+i}) - f(x_{K+i})| \nonumber \\&\le \sum _{i = 0}^{{{\hat{l}}}} \varDelta _{K+i} \max _{x\in [x_{K+i},x_{ K+1+i}]} \Vert g(x)\Vert \nonumber \\&= \sum _{i = 0}^{{{\hat{l}}}} \varDelta _{K+i} \max _{x\in [x_{K+i},x_{ K+1+i}]}\Vert g(x) - g(x_{K+i}) + g(x_{K+i})\Vert \nonumber \\&\le \sum _{i = 0}^{{{\hat{l}}}} \varDelta _{K+i} \big [ \Vert g(x_{K+i})\Vert + L\varDelta _{K+i} \big ] \qquad \text{(by } \text{(13)) }. \end{aligned}$$
(62)

To estimate the right hand side, we need to bound the total displacement made by the algorithm during those iterations. It follows from (60) that

$$\begin{aligned} {{\bar{\varDelta }}}/\nu \le \nu ^{l-1} \varDelta _{K+1}< {{\bar{\varDelta }}}\le \nu ^l\varDelta _{K+1}, \end{aligned}$$
(63)

and thus for \(i = 0,\ldots , {{\hat{l}}}\),

$$\begin{aligned} \varDelta _{K+1+i} \le \nu ^i\varDelta _{K+1}\le \nu ^{{\hat{l}}}\varDelta _{K+1} \le \nu ^{l-1}\varDelta _{K+1}< {{\bar{\varDelta }}}. \end{aligned}$$
(64)

By (54), (64), we can apply Lemma 2 to each iterate \(i = 0,\ldots , {\hat{l}}\), and obtain

$$\begin{aligned} \varDelta _{i+1} = \nu \varDelta _i. \end{aligned}$$
(65)

Thus for \(i = 0,\ldots , {\hat{l}}\),

$$\begin{aligned} \varDelta _{K+1+i} = \nu ^i\varDelta _{K+1}\le \nu ^{{\hat{l}}}\varDelta _{K+1} \le \nu ^{l-1}\varDelta _{K+1}< {{\bar{\varDelta }}}. \end{aligned}$$
(66)

Summing from \(i = 0\) to \({{\hat{l}}}\), we have

$$\begin{aligned} \sum _{i = 0}^{{{\hat{l}}}} \varDelta _{K+1+i} = \sum _{i = 0}^{{{\hat{l}}}} \nu ^{i}\varDelta _{K+1}< \frac{{{\bar{\varDelta }}}}{\nu ^{{\hat{l}}}} \sum _{i = 0}^{{{\hat{l}}}} \nu ^{i} = \frac{{{\bar{\varDelta }}}}{\nu ^{{{\hat{l}}}}} \frac{ \nu ^{{{\hat{l}}}+1}-1}{\nu -1} < \frac{{{\bar{\varDelta }}}}{\nu ^{{{\hat{l}}}}} \frac{ \nu ^{{{\hat{l}}}+1}}{\nu -1} = \frac{ \nu }{\nu -1}{{\bar{\varDelta }}}.\nonumber \\ \end{aligned}$$
(67)

By assumption, \(\varDelta _{K+1}< {{\bar{\varDelta }}}\), which implies \(\varDelta _K<\nu {{\bar{\varDelta }}}\); adding this to (67) we obtain

$$\begin{aligned} \sum _{i = 0}^{{{\hat{l}}}+ 1} \varDelta _{K+i} < \frac{ \nu ^2}{\nu -1}{{\bar{\varDelta }}}. \end{aligned}$$
(68)

Therefore, for \(i=0, \ldots , {{\hat{l}}}\),

$$\begin{aligned} \Vert g(x_{K+i})\Vert +L\varDelta _{K+i}&= \Vert g(x_{K}) + \sum _{j = 0 }^ {i-1}\left[ g(x_{K+j+1}) - g(x_{K+j})\right] \Vert +L\varDelta _{K+i} \nonumber \\&\le \left\| g(x_{K}) \right\| + \sum _{j = 0 }^ {i-1}\left\| g(x_{K+j+1}) - g(x_{K+j})\right\| +L\varDelta _{K+i} \nonumber \\&\le \left\| g(x_{K}) \right\| + \left( \sum _{j = 0 }^ {i-1}L\varDelta _{K+j}\right) +L\varDelta _{K+i} \nonumber \\&< \left\| g(x_{K}) \right\| + L\sum _{j = 0 }^ {{\hat{l}} + 1 }\varDelta _{K+j} \quad (\text{ since } \ i< {{\hat{l}}}+1) \nonumber \\&< \left\| g(x_{K}) \right\| + \frac{\nu ^2}{\nu - 1} L {{\bar{\varDelta }}} \qquad (\text{ by } \text{(68) }). \end{aligned}$$
(69)

Substituting this inequality into (62), we obtain for any \(k = K+1,\ldots , K+{\hat{l}} + 1\),

$$\begin{aligned} | f(x_{k }) - f(x_{K})|&\le \sum _{i = 0}^{{{\hat{l}}}} \varDelta _{K+i} \left[ \left\| g(x_{K}) \right\| + \frac{\nu ^2}{\nu - 1} L {{\bar{\varDelta }}}\right] \nonumber \\&< \left[ \Vert g(x_{K})\Vert + \frac{\nu ^2}{\nu - 1} L {{\bar{\varDelta }}} \right] \frac{\nu ^2}{\nu - 1} {{\bar{\varDelta }}} \nonumber \\&\le \left[ (r+1) \epsilon _g + \gamma + \frac{\nu ^2}{\nu - 1} L {{\bar{\varDelta }}} \right] \frac{\nu ^2}{\nu - 1} {{\bar{\varDelta }}} \qquad (\hbox { since}\ x_K \in C_1) \nonumber \\&= \left[ (r+1) \epsilon _g + \gamma + \frac{ {\nu ^2} L \gamma }{(\nu - 1 )rM} \right] \frac{ \nu ^2 \gamma }{(\nu - 1)rM} \qquad \text{(by } \text{(56)) } \nonumber \\&= G. \end{aligned}$$
(70)

Therefore, for \(k=K+1, \ldots , K+ {\hat{l}} + 1\),

$$\begin{aligned} f(x_{k}) < f(x_{K}) + G\le \sup _{y\in C_1} f(y) + G. \end{aligned}$$
(71)

We now consider two possibilities.

Case 2a: Suppose \( K+1+l > {\hat{K}}\). Then, \({\hat{K}} - K -1 \le l-1\) and by (61) we have that \( {\hat{l}} ={\hat{K}} - K -1 \). Condition (71), thus reads

$$\begin{aligned} f(x_{k}) < f(x_{K}) + G\le \sup _{y\in C_1} f(y) + G, \qquad k=K+1, \ldots ,{\hat{K}}, \end{aligned}$$
(72)

and thus the inequality in (51) is satisfied for \(k=K+1, \ldots ,{\hat{K}} -1 \).

Case 2b: suppose \(K+1+l \le {\hat{K}}\). Then, by (60) we have that \( {\hat{l}} = l - 1\), and (71) reads

$$\begin{aligned} f(x_{k}) < f(x_{K}) + G\le \sup _{y\in C_1} f(y) + G \qquad k=K+1, \ldots , K + l. \end{aligned}$$
(73)

Let us now consider the iterates following \(K+l\) that are outside \(C_1\), i.e., those indexed by \( k = K+l+1, \ldots , {\hat{K}} -1\). Letting \(i = {{\hat{l}}}= l-1\) in (66) and recalling the first inequality in (63),

$$\begin{aligned} \varDelta _{K+l} = \nu ^{l-1} \varDelta _{K+1} \ge \frac{{{\bar{\varDelta }}}}{\nu }. \end{aligned}$$
(74)

We can therefore apply Corollary 2 to iterates indexed by \( k = K+l+1,\ldots , {\hat{K}} -1\) and deduce that

$$\begin{aligned} \varDelta _k \ge \frac{{{\bar{\varDelta }}}}{\nu }, \quad k= K+l+1,\ldots , {\hat{K}} -1. \end{aligned}$$

This fact, together with (54), allow us to invoke Lemma 3, for \( k = K+l,\ldots , {\hat{K}} -1\), to yield

$$\begin{aligned} {{\tilde{f}}}(x_{K+l}) \ge {{\tilde{f}}}(x_{K+1+l}) \ge {{\tilde{f}}}(x_{K+2+l}) \ge \ldots \ge {{\tilde{f}}}(x_{{\hat{K}}}). \end{aligned}$$
(75)

Recalling (73) with \(k = K+l\) and using (3) we obtain

$$\begin{aligned} {{\tilde{f}}}(x_{K+l}) < \sup _{y\in C_1} f(y) +G + \epsilon _f. \end{aligned}$$
(76)

This condition together with (75) yields

$$\begin{aligned} f(x_{k})\le {{\tilde{f}}}(x_{K+l}) + \epsilon _f< \sup _{y\in C_1} f(y) +G + 2\epsilon _f \qquad k = K+l,\ldots , {\hat{K}}-1 .\nonumber \\ \end{aligned}$$
(77)

Combining this bound with (73) we conclude

$$\begin{aligned} f(x_{k})< \sup _{y\in C_1} f(y) +G + 2\epsilon _f, \qquad k = K+1,\ldots , {\hat{K}}-1, \end{aligned}$$
(78)

and thus the inequality in (51) is satisfied. \(\square \)

The constant G defined in (52) is proportional to \(\epsilon _g^2,\epsilon _g\sqrt{\epsilon _f}, \epsilon _f\). Since that G characterizes the function value bounds, the dependence on \(\epsilon _f\) is expected; the dependence on \(\epsilon _g\) and \(\epsilon _g\sqrt{\epsilon _f}\) arises from the combined effect of the trust region radius and gradient norm.

We should point out that one can find examples for which our convergence results are not informative because the critical region \(C_1\) can be all of \({\mathbb {R}}^n\), as pointed out by a referee. Such an example is given by a smoothed absolute value function, with appropriate parameters and a given noise level. However, such examples are not typical.

4 Numerical experiments

To illustrate the performance of the proposed Algorithm 1, we coded it in matlab and applied it to a small selection of unconstrained optimization problems. We injected uniformly distributed noise in the evaluations of the function and gradient. Specifically, we let (c.f. (2))

$$\begin{aligned} \delta _f = X_f\in {\mathbb {R}}, \quad X_f \sim U(-\epsilon _f,\epsilon _f);\quad \ \delta _g = X_g\in {\mathbb {R}}^{n}, \quad X_g\sim {\mathbb {B}}_n(0,\epsilon _g), \end{aligned}$$
(79)

where \(U(-a,a)\) denotes the uniform distribution from \(-a\) to a, and \({\mathbb {B}}_n(0,a)\) denotes the n dimensional ball centered at 0 with radius a. By generating noise in this way we satisfy Assumption 2.

We set the parameters in Algorithm 1 as follows: \(c_0 = 0.1, c_1 = 1/4, c_2 = 1/2\) and \(\nu = 2\). The solution of the trust region subproblem (Step 3 of Algorithm 1) was computed using the standard Newton-CG method described e.g. in [27], with termination accuracy \(10^{-8}\). In order to better illustrate the performance of the algorithm in the presence of noise, we did not include a stop test and simply ran it for 200 iterations, which was sufficient to observe its asymptotic behavior.

4.1 Failure of the classical trust region algorithm

We present two examples showing failure of the classical trust region algorithm, in contrast with Algorithm 1. First, we consider the simple quadratic function

$$\begin{aligned} f = x^T D x, \end{aligned}$$
(80)

where \(x\in {\mathbb {R}}^8\) and D is the scaled identity matrix \( D = 1e-5\ {\mathbb {I}}_8. \) We set \(\epsilon _f = 10\) and \(\epsilon _g = 0.01\) in (79). The Hessian of the quadratic model (4) was defined as \(B_k = \nabla ^2 f(x_k)\); i.e., we did not inject noise in this experiment. We started both algorithms from \(x_0 = (1e5,1e5,1e5,0,0,\ldots .,0)\), with an initial trust region radius \(\varDelta _0=1\). The results are displayed Fig. 1.

Fig. 1
figure 1

New and classical trust region algorithms applied to a simple quadratic problem

The four panels in Fig. 1 compare the performance of the classical algorithm (red dashed line) and Algorithm 1 (blue solid line). The horizontal axis in each panel records the iteration number. In the upper left panel (a) we report the norm of the (noiseless) gradient \(\Vert \nabla f(x_k)\Vert \), along with the injected noise level \(\epsilon _g\) (dashed black line); the light blue dashed line plots the lowest value generated by Algorithm 1 in the past 25 iterations. We also plot in purple the size of the critical region \(C_1\), i.e. the value of the right-hand side in (37). For this particular problem, the \(C_1\) level norm of the gradient is roughly 0.8714. In the upper right panel (b) we report the trust region radius; in the lower-left panel (c) the distance to solution; and in the lower right panel (d), the computed actual-to-predicted reduction ratio \(\rho _k\); for graphical clarity, ratios greater than 5 or less than \(-5\) were plotted as \(+/- 5\) in panel (d).

We observe that the classical algorithm exhibits large oscillations in \(\rho _k\), which causes the trust region radius to shrink so much that significant progress cannot be made. In contrast, \(\rho _k\) is controlled well in Algorithm 1. In this test, the initial the trust region radius \(\varDelta _0\) is not small.

In the next experiment, we illustrate the damaging effect that a very small \(\varDelta _0\) can have on the classical algorithm, but not on the proposed algorithm. We applied the two algorithms to the tri-diagonal function

$$\begin{aligned} f(x)= \frac{1}{2}\left( x^{(1)}-1\right) ^{2}+\frac{1}{2} \sum _{i=1}^{N-1}\left( x^{(i)}-2 x^{(i+1)}\right) ^{4}, \quad N=200. \end{aligned}$$
(81)

The results are reported in Fig. 2. In the upper left panel, we again plot in purple the size of the critical region \(C_1\), i.e. the value of the right-hand side in (37). (The latter requires knowledge of the constant M, which we approximate by the norm of the Hessian at the solution.) This panel shows that the theoretical prediction given in Theorem 6 is pessimistic when compared to the final achieved accuracy in the gradient, as is to be expected of convergence results that assume that the largest possible error occurs at every iteration. The upper right hand panel illustrates that Algorithm 1 is able to quickly increase the trust region radius an allow progress, unlike the classical algorithm.

Fig. 2
figure 2

New and classical trust region algorithms initialized with small trust region radius

4.2 General performance of the proposed algorithm

We also tested the two algorithms on a subset of problems from [33]; the results are presented in the supplementary material. As a representative of these runs, we report the results for the tri-diagonal objective function (81). This time, the Hessian \( B_k\) of the quadratic model (4) is obtained by injecting noise in the true Hessian matrix. We define

$$\begin{aligned}{} & {} B_k = \nabla ^2 f(x_k) + \delta _B, \end{aligned}$$
(82)
$$\begin{aligned}{} & {} \delta _B = \frac{A^T \varLambda A}{\Vert A\Vert ^2}, \quad A_{ij}\sim U(0,1), \quad (\varLambda )_{ii} \sim U(-\epsilon _B,\epsilon _B), \end{aligned}$$
(83)

where \(\varLambda \) is a diagonal matrix. Thus, the matrices \(B_k\) are symmetric but not necessarily positive definite. We employed larger noise levels than in the previous experiments: \(\epsilon _f = 10\), \(\epsilon _g = 100\), and \(\epsilon _B = 1000\). This simulates the situation that may occur when employing finite difference approximations, where the error increases with the order of differentiation. Both algorithms were initialized from the same starting point \(x_0\), which was generated such that each entry in \(x_0\) is sampled uniformly from \(-50\) to 50. To ensure a fair comparison, at each iterate we inject exactly the same noise into both algorithms.

We report the results in Fig. 3, which displays the same information as in Fig. 2. We observe that both algorithms perform similarly before entering the noisy regime. Algorithm 1 exhibits larger oscillations in the gradient norm due to the larger trust region radius, but achieves a lower objective function value. Whereas the large reduction in the trust region radius led to failures of the classical algorithm in the examples reported above, in many test runs such as that given in Fig. 3, it can be beneficial by producing increasingly smaller steps that yield milder oscillations in the gradient norm than Algorithm 1. We cannot, however, recommend this type of trust region reduction as a general procedure for handling noise since failures can happen unexpectedly.

Fig. 3
figure 3

Comparison of the new and classical trust region algorithms when solving problem (81) with uniform noise given by (79) (8283)

4.3 Tests with unbounded noise

We performed tests to observe the behavior of Algorithm 1 in the presence of unbounded noise. We added either normally distributed noise or sub-exponentially distributed noise, and defined \(\epsilon _f, \epsilon _g\) as the standard deviation or the half maximum density level (in the case where the standard deviation is not defined).

For case of normally distributed noise, the objective function is given by (81), and we defined

$$\begin{aligned} \delta _f = X_f\in {\mathbb {R}}, \ X_f \sim {\mathcal {N}}(0,\epsilon _f),\ \text{ and }\ \delta _g = X_g\in {\mathbb {R}}^{n}, \ X_g(i)\sim {\mathcal {N}}\left( 0,\frac{\epsilon _g}{\sqrt{N}}\right) .\nonumber \\ \end{aligned}$$
(84)

The results for a typical run are reported in Fig. 4; they show the robustness of Algorithm 1.

Fig. 4
figure 4

Comparison of the new and classical trust region algorithms when solving problem (81) with normal noise given by (84)

We also performed tests where noise has a heavy tail distribution. We let

$$\begin{aligned} \delta _f = X_f\in {\mathbb {R}}, \ X_f \sim \textrm{Cauchy}(0,\epsilon _f),\ \ \delta _g = X_g\in {\mathbb {R}}^{n}, \ X_g(i)\sim \textrm{Cauchy}\left( 0,\frac{\epsilon _g}{\sqrt{N}}\right) .\nonumber \\ \end{aligned}$$
(85)

where \(\textrm{Cauchy}(\mu , \sigma )\) denotes Cauchy distribution with statistical median \(\mu \) and half maximum density level \(\sigma \). The result of a typical run is reported in Fig. 5.

Fig. 5
figure 5

Comparison of the new and classical trust region algorithms when solving problem (81) with subexponential noise given by (85)

We we repeated the experiments (for both types of noise) with larger and smaller values of \(\epsilon _f, \epsilon _g\) and found the performance of Algorithm 1 to be robust across all runs.

4.4 Evaluating the theoretical results

We have seen that the critical region \(C_1\) gives a pessimistic estimate of the achievable accuracy in the gradient because the analysis assumes worst-case behavior at each iteration, rather than providing estimates in high probability. Nevertheless, Theorem 6 identifies the functional relationship between the achievable accuracy and the noise level: the right hand side in (37) scales as a function of \(\epsilon _g\) and \(\sqrt{\epsilon _f}\). We performed numerical tests to measure if the accuracy achieved in practice scales in that manner.

We employed the tridiagonal function (81), for which we can estimate the constant M, as mentioned above. For given \(\epsilon _f\) and \(\epsilon _g\), we compute the right hand side in (37), which we denote as \(C(\epsilon _f,\epsilon _g)\), and ran Algorithm 1 as in the previous test. We repeated the run 10 times using different seeds, \(s=1,\ldots ,10\), to generate noise. For each run, we track the smallest value of \(\Vert {{\tilde{g}}}_k\Vert \) during the most recent 25 iterations and record the smallest such value observed during the run, which we denote as \(\Vert \tilde{g}^*({\epsilon _g,\epsilon _f,s})\Vert \), where s denotes the seed. In Fig. 6, we report the quantity

$$\begin{aligned} R(\epsilon _f,\epsilon _g) =\log _{10} \frac{C(\epsilon _f,\epsilon _g)}{\sum _{s = 1}^{10} \Vert \tilde{g}^*({\epsilon _g,\epsilon _f,s})\Vert } \end{aligned}$$
(86)

as we vary \(\epsilon _f\) and \(\epsilon _g\) from \(10^{-2}\) to \(10^2\). The fact that the ratio between the theoretical bound and the smallest gradient norm measured in practice remained roughly constant gives numerical support to the claim that the achievable gradient norm is proportional to \(\epsilon _g\) and \(\sqrt{\epsilon _f}\). We should note that these observations are valid only when averaging multiple runs with different seeds, as one can observe significant variations among individual runs of Algorithm 1.

Fig. 6
figure 6

\(R(\epsilon _f,\epsilon _g)\) given in (86): \(\hbox {Log}_{{10}}\) of the ratio between predicted and actual accuracy in the gradient, as a function of these noise level \(\epsilon _f, \epsilon _g\). The small variation in these numbers suggests that Theorem 6 gives the correct dependence on the noise levels

5 Final remarks

In this paper, we proposed a noise-tolerant trust region algorithm that avoids the pitfall of the classical algorithm, which can shrink the trust region prematurely, preventing progress toward a stationary point. Robustness is achieved by relaxing the ratio test used in the step acceptance, so as to account for errors in the function.

We showed that when the noise in the function and gradient evaluations is bounded by the constants \(\epsilon _f, \epsilon _g\), an infinite subsequence of iterates satisfies

$$\begin{aligned} \Vert g_k \Vert = O({\sqrt{\epsilon _f}}, \epsilon _g). \end{aligned}$$
(87)

When noise is not present, our results yield the limit \(\{ \Vert g_k\Vert \} \rightarrow 0\) (the sets \(C_1\) and \(C_2\) in Theorem 6 and Theorem 7 coincide in this case).

The technique and analysis presented here are relevant to the case when noise can be diminished as needed, as assumed e.g. in [7, 8, 15]. Algorithm 1 can be run until it ceases to make significant progress, at which point the accuracy in the function and gradient is increased (i.e., \(\epsilon _f, \epsilon _g\) are reduced) and the algorithm is restarted with the new value of \(\epsilon _f\) in (7); this process can then be repeated. This provides a disciplined approach for achieving high accuracy in the solution using a noise-tolerant trust region algorithm.