1 Introduction

Convex optimization is an important and well-studied subject of numerical analysis. The canonical setting for such problems is to find the minimum of a convex function \(E\) over a domain in \({\mathbb {R}}^d\). Various numerical algorithms have been developed for minimization problems, and a priori bounds for their performance have been proven. We refer the reader to [1, 911] for the core results in this area.

In this paper, we are concerned with the more general setting where \(E\) is defined on a domain \(D\) in a general Banach space \(X\) with norm \(\Vert \cdot \Vert =\Vert \cdot \Vert _X\). Thus, our main interest is in approximating

$$\begin{aligned} E^{*}:= \inf _{x\in D} E(x). \end{aligned}$$
(1.1)

Problems of this type occur in many important application domains, such as statistical estimation and learning, optimal control, and shape optimization. Another important motivation for studying such general problems, even for finite dimensional spaces \(X\), is that when the dimension \(d\) of \(X\) is large, we would like to obtain bounds on the convergence rate of a proposed algorithm that are independent of this dimension.

Solving (1.1) is an example of a high-dimensional problem and is known to suffer the curse of dimensionality without additional assumptions on \(E\) which serve to reduce its dimensionality. These additional assumptions take the form of smoothness restrictions on \(E\) and assumptions which imply that the minimum in (1.1) is attained on a subset of \(D\) with additional structure. Typical assumptions for the latter involve notions of sparsity or compressibility, which are by now heavily employed concepts for high-dimensional problems. We will always assume that there is a point \(x^*\in D\) where the minimum \(E^*\) is attained, \(E(x^*)=E^*\). We do not assume \(x^*\) is unique. Clearly, the set \(D^*=D^*(E)\subset D\) of all points where the minima is attained is convex.

The algorithms studied in this paper utilize dictionaries \(\mathcal{D}\) of \(X\). A set of elements \(\mathcal{D}\subset X\), whose closed linear span coincides with \(X\) is called a symmetric dictionary if \(\Vert g\Vert :=\Vert g\Vert _X=1\), for all \(g\in \mathcal{D}\), and in addition \(g\in \mathcal{D}\) implies \(-g\in \mathcal{D}\). The simplest example of a dictionary is \(\mathcal{D}=\{\pm \varphi _j\}_{j\in \Gamma }\) where \(\{\varphi _j\}_{j\in \Gamma }\) is a Schauder basis for \(X\). In particular for \(X={\mathbb {R}}^d\), one can take the canonical basis \(\{e_j\}_{j=1}^d\).

Given, such a dictionary \(\mathcal{D}\), there are several types of domains \(D\) that are employed in applications. Sometimes, these domains are the natural domain of the physical problem. Other times these are constraints imposed on the minimization problem to ameliorate high dimensionality. We mention the following three common settings.

Sparsity constraints The set \(\Sigma _n(\mathcal{D})\) of functions

$$\begin{aligned} f=\sum _{g\in \Lambda }c_gg,\quad \#(\Lambda )=n,\quad \Lambda \subset \mathcal {D}, \end{aligned}$$
(1.2)

is called the set of sparse functions of order \(n\) with respect to the dictionary \(\mathcal{D}\). One common assumption is to minimize \(E\) on the domain \(D=\Sigma _n(\mathcal{D})\), i.e., to look for an \(n\) sparse minimizer of (1.1).

\(\ell _1\) constraints A more general setting is to minimize \(E\) over the closure \(A_1(\mathcal{D})\) (in \(X\)) of the convex hull of \(\mathcal{D}\). A slightly more general setting is to minimize \(E\) over one of the sets

$$\begin{aligned} \mathcal{L}_M:=\{g\in X:\ g/M\in A_1(\mathcal{D})\}. \end{aligned}$$
(1.3)

Sometimes \(M\) is allowed to vary as in model selection or regularization algorithms from statistics. This is often referred to as \(\ell _1\) minimization.

Unconstrained optimization Imposed constraints, such as sparsity or assuming \(D=A_1(\mathcal{D})\), are sometimes artificial and may not reflect the original optimization problem. We consider therefore the unconstrained minimization where \(D=X\). We always make the assumption that the minimum of \(E\) is actually assumed. Therefore, there is a point \(x^*\in X\) where

$$\begin{aligned} E^*=E(x^*). \end{aligned}$$
(1.4)

We do not require that \(x^*\) is unique. Notice that in this case the minimum \(E^*\) is attained on the set

$$\begin{aligned} D_0:=\{x\in X: \ E(x)\le E(0)\}. \end{aligned}$$
(1.5)

In what follows, we refer to minimization over \(D_0\) to be the unconstrained minimization problem.

A typical greedy optimization algorithm builds approximations to \(E^*\) of the form \(E(G_m), m=1,2,\ldots \) where the elements \(G_m\) are built recursively using the dictionary \(\mathcal{D}\) and typically are in \(\Sigma _m(\mathcal{D})\). We will always assume that the initial point \(G_0\) is chosen as the 0 element. Given that \(G_{m-1}\) has been defined, one first searches for a direction \(\varphi _m\in \mathcal{D}\) for which \(E(G_{m-1}+\alpha \varphi _m)\) decreases significantly as \(\alpha \) moves away from zero. Once, \(\varphi _m\) is chosen, then one selects \(G_{m}=G_{m-1}+\alpha _m\varphi _m\) or more generally \(G_{m}=\alpha _m'G_{m-1}+\alpha _m\varphi _m\), using some recipe for choosing \(\alpha _m\) or more generally \(\alpha _m,\alpha _m'\). Algorithms of this type are referred to as greedy algorithms and will be the object of study in this paper.

There are different strategies for choosing \(\varphi _m\) and \(\alpha _m,\alpha _m'\) (see, for instance, [24, 6, 8, 13, 19, 20] and [7]). One possibility to choose \(\varphi _m\) is to use the Fréchet derivative \(E'(G_{m-1})\) of \(E\) to choose a steepest descent direction. This approach has been amply studied and various convergence results for steepest descent algorithms have been proven, even for the general Banach space setting. We refer the reader to the papers [17, 18, 20] which are representative of the convergence results known in this case. The selection of \(\alpha _m,\alpha _m'\) is commonly referred to as relaxation and is well studied in numerical analysis, although the Banach space setting needs additional attention.

Our interest in the present paper are greedy algorithms that do not utilize \(E'\). They are preferred since \(E'\) is not given to us and therefore, in numerical implementations, must typically be approximated at any given step of the algorithm. We will analyze several different algorithms of this type which are distinguished from one another by how \(G_{m}\) is gotten from \(G_{m-1}\) both in the selection of \(\varphi _m\) and the parameters \(\alpha _m,\alpha _m'\). Our algorithms are built with ideas similar to the analogous, well-studied, greedy algorithms for approximation of a given element \(f\in X\). We refer the reader to [16] for a comprehensive description of greedy approximation algorithms.

In this introduction, we limit ourselves to two of the main algorithms studied in this paper. The first of these, which we call the relaxed \(E\)-Greedy algorithm (REGA(co)) was introduced in [20] under the name sequential greedy approximation.

Relaxed \(E\)-Greedy Algorithm (REGA(co)) We define \(G_0:=0\). For \(m\ge 1\), assuming \(G_{m-1}\) has already been defined, we take \(\varphi _m \in \mathcal{D}\) and \(0\le \lambda _m \le 1\) such that

$$\begin{aligned} E((1-\lambda _m)G_{m-1} + \lambda _m\varphi _m) = \inf _{0\le \lambda \le 1;g\in \mathcal{D}}E((1-\lambda )G_{m-1} + \lambda g) \end{aligned}$$

and define

$$\begin{aligned} G_m:= (1-\lambda _m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

We assume that there exist such minimizing \(\varphi _m\) and \(\lambda _m\).

We note that the REGA(co) is a modification of the classical Frank–Wolfe algorithm [5]. For convenience, we have assumed the existence of a minimizing \(\varphi _m\) and \(\lambda _m\). However, we also analyze algorithms with only approximate implementation which avoids this assumption.

Observe that this algorithm is in a sense built for \(A_1(\mathcal{D})\) because each \(G_m\) is obviously in \(A_1(\mathcal{D})\). The next algorithm, called the \(E\)-Greedy algorithm with free relaxation (EGAFR(co)), makes some modifications in the relaxation step that will allow it to be applied to the more general unconstrained minimization problem on \(D_0\).

\(E\)-Greedy Algorithm with free relaxation (EGAFR(co)) We define \(G_0:= 0\). For \(m\ge 1\), assuming \(G_{m-1}\) has already been defined, we take \(\varphi _m \in \mathcal{D}, \alpha _m, \beta _m\in {\mathbb {R}}\) satisfying (assuming existence)

$$\begin{aligned} E(\alpha _mG_{m-1} + \beta _m\varphi _m) = \inf _{ \alpha ,\beta \in {\mathbb {R}};g\in \mathcal{D}}E(\alpha G_{m-1} + \beta g) \end{aligned}$$

and define

$$\begin{aligned} G_m:= \alpha _m G_{m-1} + \beta _m\varphi _m. \end{aligned}$$

It is easy to see that each of these algorithms has the following monotonicity

$$\begin{aligned} E(G_0)\ge E(G_1) \ge E(G_2) \ge \cdots . \end{aligned}$$

Our main goal in this paper is to understand what can be said a priori about the convergence rate of a specific greedy optimization algorithm of the above form. Such results are built on two assumptions: (i) the smoothness of \(E\), (ii) assumptions that the minimum is attained at a point \(x^*\) satisfying a constraint such as the sparsity or \(\ell _1\) constraint. In what follows to measure the smoothness of \(E\), we introduce the modulus of smoothness

$$\begin{aligned} \rho (E,u):=\rho (E,S,u):=\frac{1}{2}\sup _{x\in S, \Vert y\Vert =1}|E(x+uy)+E(x-uy)-2E(x)|, \end{aligned}$$
(1.6)

of \(E\) on any given set \(S\). We say that \(E\) is uniformly smooth on \(S\) if \(\rho (E,S,u)/u\rightarrow 0\) as \(u\rightarrow 0\).

The following theorem for REGA(co) is a prototype of the results proved in this paper.

Theorem 1.1

Let \(\displaystyle {E^*:=\inf _{x\in A_1(\mathcal{D})}E(x)}\).

  1. (i)

    If \(E\) is uniformly smooth on \( A_1(\mathcal{D})\), then the REGA(co) converges:

    $$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =E^*. \end{aligned}$$
    (1.7)
  2. (ii)

    If in addition, \(\rho (E,A_1(\mathcal{D}), u) \le \gamma u^q, 1<q\le 2\), then

    $$\begin{aligned} E(G_m)-E^*\le C(q,\gamma )m^{1-q}, \end{aligned}$$
    (1.8)

with a positive constant \(C(q,\gamma )\) which depends only on \(q\) and \(\gamma \).

The case \(q=2\) of this theorem was proved in [20]. We prove this theorem in Sect. 2.

As we have already noted, the EGAFR(co) is designed to solve the unconstrained minimization problem where the domain \(D=X\). The performance of this algorithm will depend not only on the smoothness of \(E\) but also on the compressibility of a point \(x^*\in D^*\) where \(E\) takes its minimum. To quantify this compressibility, we introduce

$$\begin{aligned} A(\epsilon ):=A(E,\epsilon ):= \inf \{M: \exists y\in \mathcal{L}_M \ \mathrm{such\ that\ } \, E(y)-E^*\le \epsilon \}. \end{aligned}$$
(1.9)

An equivalent way to quantify this compressibility is the error

$$\begin{aligned} e(E,M):= \inf _{ y\in \mathcal{L}_M} E(y)-E^*. \end{aligned}$$
(1.10)

Notice that the functions \(A\) and \(e\) are pseudo-inverses of one another.

The following theorem states the convergence properties of the EGAFR(co).

Theorem 1.2

Let \(E\) be uniformly smooth on \(X\) and let \(\displaystyle {E^*:=\inf _{x\in X}E(x)=\inf _{x\in D_0}E(x)}\).

  1. (i)

    The EGAFR(co) converges:

    $$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =\inf _{x\in X}E(x)=\inf _{x\in D_0} E(x)=E^*. \end{aligned}$$
  2. (ii)

    If the modulus of smoothness of \(E\) satisfies \(\rho (E,u)\le \gamma u^q, 1<q\le 2\), then, the EGAFR(co) satisfies

    $$\begin{aligned} E(G_m)-E^* \le C(E,q,\gamma ) \epsilon _m, \end{aligned}$$
    (1.11)

where

$$\begin{aligned} \epsilon _m:= \inf \{\epsilon :\ A(\varepsilon )^q m^{1-q}\le \epsilon \}. \end{aligned}$$
(1.12)

In particular, if for some \(r>0\), we have \(e(E,M)\le \tilde{\gamma }M^{-r}\), for all \(M\ge 1\), then

$$\begin{aligned} E(G_m)-E^* \le C(E,q,\gamma ,\tilde{\gamma }, r) m^{\frac{1-q}{1+q/r}}. \end{aligned}$$
(1.13)

We note that the EGAFR(co) is a modification of the weak greedy algorithm with free relaxation (WGAFR(co)) studied in [17]. In the WGAFR(co), we first choose the dictionary direction and then optimize over a two-dimensional subspace. In more precise words, we perform the following two steps at the \(m\)th iteration.

  1. (1)

    Choose \(\varphi _m \in {\mathcal {D}}\) as any element satisfying

    $$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal {D}}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$
  2. (2)

    Find \(w_m\) and \( \lambda _m\) such that

    $$\begin{aligned} E((1-w_m)G_{m-1} + \lambda _m\varphi _m) = \inf _{ \lambda ,w}E((1-w)G_{m-1} + \lambda \varphi _m) \end{aligned}$$

and define

$$\begin{aligned} G_m:= (1-w_m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

Also note that if \(x^*\in \mathcal{L}_M\) then the estimate in Theorem 1.2 reads

$$\begin{aligned} E(G_m)-E^* \le C(E,q,\gamma ) M^qm^{1-q}. \end{aligned}$$
(1.14)

We show in the following section how Theorems 1.1 and 1.2 are easily proven using existing results for greedy algorithms. We also introduce and analyze another greedy algorithm for convex minimization.

The most important results of the present paper are in Sect. 3 and are motivated by numerical considerations. Very often, we cannot calculate the values of \(E\) exactly. Even if we can evaluate \(E\) exactly, we may not be able to find the exact value of, say, the quantity

$$\begin{aligned} \inf _{0\le \lambda \le 1;g\in \mathcal{D}} E((1-\lambda )G_{m-1}+\lambda g) \end{aligned}$$

in the REGA(co). This motivates us to study in Sect. 3 various modifications of the above algorithms. For example, the following algorithm, which is an approximate variant of the REGA(co), was introduced in [20].

Relaxed \(E\)-Greedy Algorithm with error \(\delta \) (REGA( \(\delta \) )) Let \(\delta \in (0,1]\). We define \(G_0:=0\). Then, for each \(m\ge 1\) we have the following inductive definition: We take any \(\varphi _m \in \mathcal{D}\) and \(0\le \lambda _m \le 1\) satisfying

$$\begin{aligned} E((1-\lambda _m)G_{m-1} + \lambda _m\varphi _m) \le \inf _{0\le \lambda \le 1;g\in \mathcal{D}}E((1-\lambda )G_{m-1} + \lambda g) +\delta \end{aligned}$$

and define

$$\begin{aligned} G_m:= (1-\lambda _m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

In Sect. 3, we give modifications of this type to the above algorithms and then prove convergence results for these modifications. For example, the following convergence result is proven for the \(\hbox {REGA}(\delta )\).

Theorem 1.3

Let \(E\) be a uniformly smooth on \(A_1(\mathcal{D})\) convex function with modulus of smoothness \(\rho (E,u) \le \gamma u^q, 1<q\le 2\). Then, for the \(\hbox {REGA}(\delta )\) we have

$$\begin{aligned} E(G_m)-E^* \le C(q,\gamma ,E,c) m^{1-q}, \quad m\le \delta ^{-1/q}, \end{aligned}$$

where \(\displaystyle {E^*:=\inf _{f\in A_1(\mathcal{D})}E(x)}\).

In the case \(q=2\), Theorem 1.3 was proved in [20]. We note that our analysis is different from that in [20].

In the REGA(co) and the \(\hbox {REGA}(\delta )\), we solve the univariate convex optimization problem with respect to \(\lambda \)

$$\begin{aligned} \inf _{0\le \lambda \le 1}E((1-\lambda )G_{m-1} + \lambda g), \end{aligned}$$
(1.15)

respectively, exactly and with an error \(\delta \). It is well known (see [10]) that there are fast algorithms to solve problem (1.15) approximately. We discuss some of them in Sect. 4.

In the EGAFR(co) and the \(\hbox {EGAFR}(\delta )\) (see Sect. 3 for this algorithm), we solve the convex optimization problem for a function on two variables

$$\begin{aligned} \inf _{\lambda ,w}E((1-w)G_{m-1} + \lambda g), \end{aligned}$$
(1.16)

respectively, exactly and with an error \(\delta \). We describe in Sect. 5 how univariate optimization algorithms can be used for approximate solution of (1.16).

2 Analysis of Greedy Algorithms

We begin this section by showing how to prove the results for REGA(co) and EGAFR(co) stated in the introduction, namely Theorems 1.1 and 1.2. The proof of convergence results for greedy algorithms typically is done by establishing a recursive inequality for the error \(E(G_n)-E^*\). To analyze the decay of this sequence of errors will need the following lemma.

Lemma 2.1

If a sequence \(a_m, m\ge 0\), of nonnegative numbers satisfies

$$\begin{aligned} a_m\le a_{m-1}(1-ca_{m-1}^{p}),\quad m\ge 1, \end{aligned}$$
(2.1)

with \(c>0\) and \(p>0\). Then

$$\begin{aligned} a_n\le Cn^{-1/p},\quad n\ge 1, \end{aligned}$$
(2.2)

with the constant \(C\) depending only on \(p\) and \(c\). In the case \(p\ge 1\) we have \(C\le c^{-1/p}\).

Proof

In the case \(p\ge 1\) which is used in this paper this follows from Lemma 2.16 of [16]. In the case \(p\ge 1\), Lemma 2.1 was often used in greedy approximation in Banach spaces (see [16], Chapter 6). For the general case \(p>0\) see Lemma 4.2 of [12]). \(\square \)

To establish a recursive inequality for the error in REGA(co), we will use the following lemma about REGA(co).

Lemma 2.2

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\). Then, for any \(f\in A_1(\mathcal{D})\) and the iterations \(G_m\) of the REGA(co), we have

$$\begin{aligned} E(G_m) \le E(G_{m-1} )\!+\! \inf _{0\le \lambda \le 1}(\!-\!\lambda (E(G_{m-1} )\!-\!E(f))\!+\! 2\rho (E, 2\lambda )), \quad m=1,2,\ldots .\nonumber \\ \end{aligned}$$
(2.3)

Proof

A similar result was proved in Lemma 3.1 of [17] for a different greedy algorithm denoted by WRGA(co) in [17]. In order to distinguish the two algorithms, we denote by \(\bar{G}_m\) the output of WRGA(co). The relaxation step in WRGA(co) is exactly the same as in our REGA(co). However, the choice of direction \(\bar{\varphi }_m\) in WRGA(co) was based on a maximal gradient descent. This means that at each step the \(\bar{G}_{m-1}\) is also possibly different than our \(G_{m-1}\) of REGA(co). However, an examination of the proof of Lemma 3.1 shows that it did not matter what \(\bar{G}_{m-1}\) is as long as it is in \(\Sigma _{m-1}(\mathcal{D})\). So Lemma 3.1 holds for our \(G_{m-1}\) and if we let \(\tilde{G}_m\) denote the result of applying WRGA(co) to our \(G_{m-1}\), then we have

$$\begin{aligned} E(G_m)\le E(\tilde{G}_m)\le E(G_{m-1} )+ \inf _{0\le \lambda \le 1}(-\lambda (E(G_{m-1} )-E(f))+ 2\rho (E, 2\lambda )).\nonumber \\ \end{aligned}$$
(2.4)

Here, the first inequality is because REGA(co) minimizes error over all choices of directions \(\varphi \) from the dictionary and all choices of the relaxation parameter and thereby is at least as good as the choice from WRGA(co). The last inequality is from Lemma 3.1 of [17]. Thus, we have proven the lemma. \(\square \)

Proof of Theorem 1.1

The proof of this theorem is similar to the proof of Theorem 3.1 and Theorem 3.2 in [17]. We illustrate the proof of (1.8). If we denote by \(a_m:=E(G_m)-E^*\), then subtracting \(E^*\) from both sides of (2.3) gives the recursive inequality

$$\begin{aligned} a_m\le a_{m-1} + \inf _{0\le \lambda \le 1} \{-\lambda a_{m-1}+2\gamma (2\lambda )^q\}. \end{aligned}$$
(2.5)

If we choose \(\lambda \) to satisfy

$$\begin{aligned} \lambda a_{m-1}=4\gamma (2\lambda )^q \end{aligned}$$
(2.6)

provided it is not \({>}1\) and choose \(1\) otherwise and use this value in (2.5), we obtain in case \(\lambda \le 1\)

$$\begin{aligned} a_m\le a_{m-1}\left( 1-ca_{m-1}^{\frac{1}{q-1}}\right) , \end{aligned}$$
(2.7)

with \(c>0\) a constant depending only on \(\gamma \) and \(q\). This recursive inequality then gives the decay announced in Theorem 1.1 because of Lemma 2.1. The case \(\lambda =1\) can be treated as in the proof of Theorem 3.2 from [17]. \(\square \)

Proof of Theorem 1.2

This proof is derived from results in [17] in a similar way to how we have proved Theorem 1.1 for REGA(co). An algorithm, called WGAFR(co), was introduced in [17] which differs from EGAFR(co) only in how each \(\varphi _m\) is chosen. One then uses the analysis in WGAFR(co). Also, part (ii) of Theorem 1.2 follows from Theorem 3.8 with \(\delta =0\).

The above-discussed algorithms REGA(co) and EGAFR(co) provide sparse approximate solutions to the corresponding optimization problems. These approximate solutions are sparse with respect to the given dictionary \(\mathcal{D}\), but they are not obtained as an expansion with respect to \(\mathcal{D}\). This means that at each iteration of these algorithms we update all the coefficients of sparse approximants. Sometimes it is important to build an approximant in the form of expansion with respect to \(\mathcal{D}\). The reader can find a discussion of greedy expansions in [16, Section 6.7]. For comparison with the algorithms, we have already introduced, we recall a greedy-type algorithm for unconstrained optimization which uses only function values and builds sparse approximants in the form of expansion that was introduced and analyzed in [18]. Let \(\mathcal{C}:=\{c_m\}_{m=1}^\infty \) be a fixed sequence of positive numbers.\(\square \)

\(E\)-Greedy Algorithm with coefficients \(\mathcal{C}(\mathbf {EGA}(\mathcal{C}))\) We define \(G_0:=0\). Then, for each \(m\ge 1\) we have the following inductive definition:

  1. (i)

    Let \(\varphi _m\in \mathcal{D}\) be such that (assuming existence)

    $$\begin{aligned} E(G_{m-1}+c_m\varphi _m)=\inf _{g\in \mathcal{D}}E(G_{m-1}+c_m g). \end{aligned}$$
  2. (ii)

    Then define

    $$\begin{aligned} G_m:=G_{m-1}+c_m\varphi _m. \end{aligned}$$

In the above definition, we can restrict ourselves to positive numbers because of the symmetry of the dictionary \(\mathcal{D}\).

For the analysis of this algorithm, we will assume that the sets

$$\begin{aligned} D_C:=\{x:E(x)\le E(0)+C\} \end{aligned}$$

are bounded for all finite \(C\). We recall two results for the \(\hbox {EGA}(\mathcal{C})\) that were proved in [18].

Theorem 2.3

Let \(\mu (u)=o(u)\) as \(u\rightarrow 0\) and let \(E\) be a uniformly smooth convex function satisfying

$$\begin{aligned} E(x+uy)-E(x)-u\langle E'(x),y\rangle \le 2\mu (u), \end{aligned}$$
(2.8)

for \( x\in D_2, \Vert y\Vert =1, |u|\le 1\). Assume that the coefficients sequence \(\mathcal{C}:=\{c_j\}, c_j\in [0,1]\) satisfies the conditions

$$\begin{aligned}&\sum _{k=1}^\infty c_k =\infty ,\end{aligned}$$
(2.9)
$$\begin{aligned}&\sum _{k=1}^\infty \mu (c_k)\le 1. \end{aligned}$$
(2.10)

Then, for each dictionary \(\mathcal{D}\), the \(\hbox {EGA}(\mathcal{C})\) satisfies

$$\begin{aligned} \lim _{m\rightarrow \infty }E(G_m)=\inf _{x\in X}E(x)=:E^*. \end{aligned}$$

Theorem 2.4

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\le \gamma u^q, q\in (1,2]\) on \(D_2\). We set \(s:=\frac{2}{1+q}\) and \(\mathcal{C}_s:=\{ck^{-s}\}_{k=1}^\infty \) with \(c\) chosen in such a way that \(\gamma c^q \sum \nolimits _{k=1}^\infty k^{-sq} \le 1\). Then, the \(\hbox {EGA}(\mathcal{C}_s)\) converges with the following rate: for any \(r\in (0,1-s)\)

$$\begin{aligned} E(G_m)-\inf _{x\in A_1(\mathcal{D})}E(x) \le C(r,q,\gamma )m^{-r}. \end{aligned}$$

Let us now turn to a brief comparison of the above algorithms and their known convergence rates. The REGA(co) is designed for solving optimization problems on domains \(D\subset A_1(\mathcal{D})\) and requires that \(D^*\cap A_1(\mathcal{D})\ne \emptyset \). The EGAFR(co) is not limited to the \(A_1(\mathcal{D})\) but applies for any optimization domain as long as \(E\) achieves its minimum on a bounded domain. As we have noted earlier, if there is a point \(D^*\cap A_1(\mathcal{D})\ne \emptyset \), then EGAFR(co) provides the same convergence rate \((O(m^{1-q}))\) as REGA(co). Thus, EGAFR(co) is more robust and requires the solution of only a slightly more involved minimization at each iteration.

The advantage of \(\hbox {EGA}(\mathcal{C})\) is that it solves a simpler minimization problem at each iteration since the relaxation parameters are set in advance. However, it requires knowledge of the smoothness order \(q\) of \(E\) and also gives a poorer rate of convergence than REGA(co) and the EGAFR(co).

To continue this discussion, let us consider the very special case where \(X=\ell ^d_p\) and the dictionary \(\mathcal{D}\) is finite, say \(\mathcal{D}=\{g_j\}_{j=1}^N\). In such a case, the existence of \(\varphi _m\) in all the above algorithms is easily proven. The \(\hbox {EGA}(\mathcal{C})\) simply uses \(Nm\) function evaluations to make \(m\) iterations. The REGA(co) solves a one-dimensional optimization problem at each iteration for each dictionary element, thus \(N\) such problems. We discuss this problem in Sect. 4 and show that each such problem can be solved with exponential accuracy with respect to the number of evaluations needed from \(E\).

3 Approximate Greedy Algorithms for Convex Optimization

We turn now to the main topic of this paper which is modifications of the above greedy algorithms to allow imprecise calculations or less strenuous choices for descent directions and relaxation parameters. We begin with a discussion of the weak relaxed greedy algorithm WRGA(co) which was introduced and analyzed in [17] and which we already referred to in Sect 2. The WRGA(co) uses the gradient to choose a steepest descent direction at each iteration. The interesting aspect of WRGA(co), relative to imprecise calculations, is that it uses a weakness parameter \(t_m<1\) to allow some relative error in estimating \(\sup _{g\in \mathcal{D}} \langle -E'(G_{m-1}),g - G_{m-1}\rangle \). Here and below we use a convenient bracket notation: for a functional \(F\in X^*\) and an element \(f\in X\) we write \(F(f)= \langle F,f\rangle \). We concentrate on a modification of the second step of WRGA(co). Very often, we cannot calculate values of \(E\) exactly. Even in case, we can evaluate \(E\) exactly we may not be able to find the exact value of the \(\inf _{0\le \lambda \le 1}E((1-\lambda )G_{m-1} + \lambda \varphi _m)\). This motivates us to study the following modification of the WRGA(co). Let \(\tau :=\{t_k\}_{k=1}^\infty , t_k\in [0,1], k=1,2,\ldots \), be a weakness sequence.

Weak relaxed Greedy Algorithm with error \(\delta (\mathbf {WRGA}(\delta ))\). Let \(\delta \in (0,1]\). We define \(G_0:= 0\). Then, for each \(m\ge 1\), we have the following inductive definition.

  1. (1)

    \(\varphi _m:= \varphi ^{\delta ,\tau }_m \in \mathcal{D}\) is taken any element satisfying

    $$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m - G_{m-1}\rangle \ge t_m \sup _{g\in \mathcal{D}} \langle -E'(G_{m-1}),g - G_{m-1}\rangle . \end{aligned}$$
  2. (2)

    Then \(0\le \lambda _m \le 1\) is chosen as any number such that

    $$\begin{aligned} E((1-\lambda _m)G_{m-1} + \lambda _m\varphi _m) \le \inf _{0\le \lambda \le 1}E((1-\lambda )G_{m-1} + \lambda \varphi _m)+\delta . \end{aligned}$$

With these choices, we define

$$\begin{aligned} G_m:= (1-\lambda _m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

Thus, this algorithm differs from the \(\hbox {REGA}(\delta )\) given in the introduction, only in the choice of the direction \(\varphi _m\) at each step. Both of these algorithms are directed at solving the minimization of \(E\) over \(A_1(\mathcal{D})\). The following theorem analyzes the \(\hbox {WRGA}(\delta )\).

Theorem 3.1

Let \(E\) be uniformly smooth on \(A_1(\mathcal{D})\) whose modulus of smoothness \(\rho (E,u)\) satisfies

$$\begin{aligned} \rho (E,u) \le \gamma u^q, \quad 1<q\le 2. \end{aligned}$$
(3.1)

If the weakness sequence \(\tau :=\{t_k\}_{k=1}^\infty \) is such that \(t_k =t, k=1,2,\ldots ,\) then the \(\hbox {WRGA}(\delta )\) satisfies

$$\begin{aligned} E(G_m)-E^* \le C(q,\gamma ,t,E) m^{1-q}, \quad m\le \delta ^{-1/q}, \end{aligned}$$
(3.2)

where \(\displaystyle {E^*:=\inf _{f\in A_1(\mathcal{D})}E(x)}\).

We develop next some results which will be used to prove this theorem. Let us first note that when \(E\) is Fréchet differentiable, the convexity of \(E\) implies that for any \(x,y\)

$$\begin{aligned} E(y)\ge E(x)+\langle E'(x),y-x\rangle \end{aligned}$$
(3.3)

or, in other words,

$$\begin{aligned} E(x)-E(y) \le \langle E'(x),x-y\rangle = \langle -E'(x),y-x\rangle . \end{aligned}$$
(3.4)

The following simple lemma holds.

Lemma 3.2

Let \(E\) be Fréchet differentiable convex function. Then the following inequality holds for \(x\in S\)

$$\begin{aligned} 0\le E(x+uy)-E(x)-u\langle E'(x),y\rangle \le 2\rho (E,u\Vert y\Vert ). \end{aligned}$$
(3.5)

We use these remarks to prove the following.

Lemma 3.3

Let \(E\) be uniformly smooth on \(A_1(\mathcal{D})\) with modulus of smoothness \(\rho (E,u)\). Then, for any \(f\in A_1(\mathcal{D})\), we have that the \(\hbox {WRGA}(\delta )\) satisfies

$$\begin{aligned} E(G_m)&\le E(G_{m-1} )+ \inf _{0\le \lambda \le 1}(-\lambda t_m (E(G_{m-1} )-E(f))\\&\quad +\, 2\rho (E, 2\lambda ))+\delta , \quad m=1,2,\ldots \end{aligned}$$

and therefore

$$\begin{aligned} E(G_m)-E^*&\le E(G_{m-1} )-E^* + \inf _{0\le \lambda \le 1}(-\lambda t_m (E(G_{m-1} )-E^*)+ 2\rho (E, 2\lambda ))\nonumber \\&\quad +\,\delta , \quad m=1,2,\ldots \end{aligned}$$
(3.6)

where \(E^*:=\inf _{f\in A_1(\mathcal{D})}E(x)\).

Proof

We have

$$\begin{aligned} G_m := (1-\lambda _m)G_{m-1}+\lambda _m\varphi _m = G_{m-1}+\lambda _m(\varphi _m-G_{m-1}) \end{aligned}$$

and from the definition of \(\lambda _m\),

$$\begin{aligned} E(G_m) \le \inf _{0\le \lambda \le 1}E(G_{m-1}+\lambda (\varphi _m-G_{m-1}))+\delta . \end{aligned}$$

By Lemma 3.2 we have for any \(\lambda \)

$$\begin{aligned} E(G_{m-1}+\lambda (\varphi _m-G_{m-1}))&\le E(G_{m-1})- \lambda \langle -E'(G_{m-1}),\varphi _m-G_{m-1}\rangle \nonumber \\&\quad +\, 2 \rho (E,2\lambda ) \end{aligned}$$
(3.7)

and by step (1) in the definition of the \(\hbox {WRGA}(\delta )\) and Lemma 2.2 from [17] (see also Lemma 6.10, p. 343 of [16]) we get

$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m-G_{m-1}\rangle&\ge t_m \sup _{g\in \mathcal{D}} \langle -E'(G_{m-1}),g-G_{m-1}\rangle \nonumber \\&= t_m\sup _{\phi \in A_1(\mathcal{D})} \langle -E'(G_{m-1}),\phi -G_{m-1}\rangle \nonumber \\&\ge t_m \langle -E'(G_{m-1}),f-G_{m-1}\rangle . \end{aligned}$$

From (3.4), we obtain

$$\begin{aligned} \langle -E'(G_{m-1}),f-G_{m-1}\rangle \ge E(G_{m-1})-E(f). \end{aligned}$$

Thus,

$$\begin{aligned} E(G_m)&\le \inf _{0\le \lambda \le 1} E(G_{m-1}+ \lambda (\varphi _m-G_{m-1})) +\delta \nonumber \\&\le E(G_{m-1}) + \inf _{0\le \lambda \le 1}(-\lambda t_m (E(G_{m-1})-E(f)) + 2\rho (E,2\lambda ))+\delta ,\nonumber \\ \end{aligned}$$
(3.8)

which proves the lemma. \(\square \)

Finally, for the proof of Theorem 3.1, we will need the following result about sequences.

Lemma 3.4

If a nonnegative sequence \(a_0,a_1,\ldots ,a_N\) satisfies

$$\begin{aligned} a_m\le a_{m-1} +\inf _{0\le \lambda \le 1}(-\lambda va_{m-1}+B\lambda ^q)+\delta , \quad B>0,\quad \delta \in (0,1],\quad 0<v\le 1,\nonumber \\ \end{aligned}$$
(3.9)

for \(m\le N:=[\delta ^{-1/q}], q\in (1,2]\), then

$$\begin{aligned} a_m\le C(q,v,B,a_0) m^{1-q},\quad m\le N, \end{aligned}$$
(3.10)

with \(C(q,v,B,a_0) \le C'(q,B,a_0)v^{-q}.\)

Proof

By taking \(\lambda =0\), (3.9) implies that

$$\begin{aligned} a_m\le a_{m-1}+\delta ,\quad m\le N. \end{aligned}$$
(3.11)

Therefore, for all \(m\le N\) we have

$$\begin{aligned} a_m\le a_0+N\delta \le a_0+1, \quad 0\le m\le N. \end{aligned}$$

Now fix any value of \(m\in [0,N]\) and define \(\lambda _1:= \left( \frac{va_{m-1}}{2B}\right) ^{\frac{1}{q-1}}\), so that

$$\begin{aligned} \lambda _1 v a_{m-1}= 2B\lambda _1^q. \end{aligned}$$
(3.12)

If \(\lambda _1\le 1\) then

$$\begin{aligned} \inf _{0\le \lambda \le 1}(-\lambda va_{m-1}+ B\lambda ^q)&\le -\lambda _1 va_{m-1}+ B\lambda _1^q \nonumber \\&= -\frac{1}{2}\lambda _1va_{m-1} = -C_1(q,B)v^pa_{m-1}^p,\quad p:=\frac{q}{q-1}. \end{aligned}$$

If \(\lambda _1> 1\) then for all \(\lambda \le \lambda _1\) we have \(\lambda v a_{m-1}> 2B\lambda ^q\) and specifying \(\lambda =1\) we get

$$\begin{aligned} \inf _{0\le \lambda \le 1}(-\lambda va_{m-1}+B\lambda ^q)&\le -\frac{1}{2}va_{m-1}\nonumber \\&\le -\frac{1}{2}va_{m-1}^p(a_0+1)^{1-p} = -C_1(q,a_0)va_{m-1}^p. \end{aligned}$$

Thus, in any case, setting \(C_2:=C_2(q,v,B,a_0):=\min (C_1(q,B)v^p,C_1(q,a_0)v)\) we obtain from (3.9)

$$\begin{aligned} a_m\le a_{m-1}- C_2a_{m-1}^p +\delta ,\quad C_2 \ge C_2'(q,B,a_0)v^p, \end{aligned}$$
(3.13)

holds for all \(0\le m\le N\).

Now, to establish (3.10), we let \(n\in [0,N]\) be the smallest integer such that

$$\begin{aligned} C_2a_{n-1}^p \le 2\delta . \end{aligned}$$
(3.14)

If there is no such \(n\), we set \(n=N\). In view of (3.13), we have

$$\begin{aligned} a_m\le a_{m-1}-(C_2/2)a_{m-1}^p, \quad 1\le m\le n. \end{aligned}$$
(3.15)

If we modify the sequence \(a_m\) by defining it to be zero if \(m>n\), then this modified sequence satisfies (3.15) for all \(m\) and Lemma 2.1 gives

$$\begin{aligned} a_m\le C_3 m^{1-q},\quad 1\le m\le n, \quad C_3\le C_3'(q,B,a_0)v^{-q}. \end{aligned}$$
(3.16)

If \(n=N\), we have finished the proof. If \(n<N\), then, by (3.11), we obtain for \(m\in [n,N]\)

$$\begin{aligned} a_m\le a_{n-1} +(m-n+1)\delta \le a_{n-1}+N\delta \le a_{n-1}+ NN^{-q}\le \left[ \frac{2\delta }{C_2}\right] ^{1/p}+C_4N^{1-q}, \end{aligned}$$

where we have used the definition of \(N\). Since \(\delta ^{1/p}\le N^{-q/p}=N^{-q+1}\), we have

$$\begin{aligned} a_m \le C_5N^{1-q}\le C_5m^{1-q},\quad n\le m\le N,\quad C_5\le C_5' v^{-1}, \end{aligned}$$

where \(C_5'\) depends only on \(q,B,a_0\). This completes the proof of the lemma. \(\square \)

Proof of Theorem 3.1

We take

$$\begin{aligned} a_n:=E(G_n)-E^* \ge 0. \end{aligned}$$

Then, taking into account that \(\rho (E,u)\le \gamma u^q\), we get from Lemma 3.3

$$\begin{aligned} a_m\le a_{m-1} +\inf _{0\le \lambda \le 1}(-\lambda ta_{m-1}+2\gamma (2\lambda )^q)+\delta . \end{aligned}$$
(3.17)

Applying Lemma 3.4 with \(v=t, B=2^{1+q}\gamma \) we complete the proof of Theorem 3.1. \(\square \)

We can establish a similar convergence result for the \(\hbox {REGA}(\delta )\).

Theorem 3.5

Let \(E\) be a uniformly smooth on \(A_1(\mathcal{D})\) convex function with modulus of smoothness \(\rho (E,u) \le \gamma u^q, 1<q\le 2\). Then, for the \(\hbox {REGA}(\delta )\) we have

$$\begin{aligned} E(G_m)-E^* \le C(q,\gamma ,E) m^{1-q}, \quad m\le \delta ^{-1/q}, \end{aligned}$$

where \(\displaystyle {E^*:=\inf _{f\in A_1(\mathcal{D})}E(x)}\).

Proof

From the definition of the \(\hbox {REGA}(\delta )\), we have

$$\begin{aligned} E(G_m) \le \inf _{0\le \lambda \le 1;g\in \mathcal{D}}E((1-\lambda )G_{m-1} + \lambda g) +\delta . \end{aligned}$$

In the same way that we have proved (2.3), we obtain

$$\begin{aligned} E(G_m) \le E(G_{m-1} )+ \inf _{0\le \lambda \le 1}(-\lambda (E(G_{m-1} )-E^*)+ 2\rho (E, 2\lambda )) +\delta . \end{aligned}$$
(3.18)

Inequality (3.18) is of the same form as inequality (3.6) from Lemma 3.3. Thus, repeating the above proof of Theorem 3.1, we complete the proof of Theorem 3.5. \(\square \)

We now introduce and analyze an approximate version of the WGAFR(co).

Weak Greedy algorithm with free relaxation and error \(\delta (\mathbf {WGAFR}(\delta ))\). Let \(\tau :=\{t_m\}_{m=1}^\infty , t_m\in [0,1]\), be a weakness sequence. We define \(G_0 := 0\). Then, for each \(m\ge 1\), we have the following inductive definition.

  1. (1)

    \(\varphi _m \in \mathcal{D}\) is any element satisfying

    $$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in \mathcal{D}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$
    (3.19)
  2. (2)

    Find \(w_m\) and \( \lambda _m\) such that

    $$\begin{aligned} E((1-w_m)G_{m-1} + \lambda _m\varphi _m) \le \inf _{ \lambda ,w}E((1-w)G_{m-1} + \lambda \varphi _m) +\delta \end{aligned}$$

    and define

    $$\begin{aligned} G_m:= (1-w_m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

Theorem 3.6

Let \(E\) be a uniformly smooth convex function on \(X\) with modulus of smoothness \(\rho (E,D_1,u)\le \gamma u^q, 1<q\le 2\) and let \(\displaystyle {E^*:=\inf _{x\in X}E(x)=\inf _{x\in D_0}E(x)}\). Then, for the \(\hbox {WGAFR}(\delta )\), we have

$$\begin{aligned} E(G_m)-E^* \le C(E,q,\gamma ) \epsilon _m,\quad m\le \delta ^{-1/q} \end{aligned}$$
(3.20)

where

$$\begin{aligned} \epsilon _m:= \inf \{\epsilon :\ A(\varepsilon )^q m^{1-q}\le \epsilon \} \end{aligned}$$
(3.21)

and \(A(\epsilon )\) is defined by (1.9).

Proof

We begin with a lemma. \(\square \)

Lemma 3.7

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\) on \(D\subset D_1\). Take a number \(\varepsilon \ge 0\) and an element \(f^\varepsilon \) from \(D \) such that

$$\begin{aligned} E(f^\varepsilon ) \le \inf _{x\in D}E(x)+ \varepsilon ,\quad f^\varepsilon /B \in A_1({\mathcal {D}}), \end{aligned}$$

with some number \(B\ge 1\). Suppose that \(G_{m-1}\in D\subset D_1\) and \(\varphi _m\in {\mathcal {D}}\) is any element satisfying

$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m\sup _{g\in {\mathcal {D}}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$

Then, we have

$$\begin{aligned}&\inf _{\lambda \ge 0,w}E(G_{m-1}-wG_{m-1}+\lambda \varphi _m)-E(f^\varepsilon ) \le E(G_{m-1})-E(f^\varepsilon )\nonumber \\&\quad +\,\inf _{\lambda \ge 0}(-\lambda t_mB^{-1}(E(G_{m-1})-E(f^\varepsilon )) + 2\rho (E,C_0\lambda )), \end{aligned}$$
(3.22)

for \( m=1,2,\ldots \).

Proof

We use Lemma 3.2

$$\begin{aligned}&E(G_{m-1}+\lambda \varphi _m -wG_{m-1}) \le E(G_{m-1}) - \lambda \langle -E'(G_{m-1}),\varphi _m\rangle \nonumber \\&\quad -\,w\langle E'(G_{m-1}),G_{m-1}\rangle + 2 \rho (E,\Vert \lambda \varphi _m -wG_{m-1}\Vert ) \end{aligned}$$
(3.23)

and estimate

$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle&\ge t_m \sup _{g\in {\mathcal {D}}} \langle -E'(G_{m-1}),g\rangle \\&= t_m\sup _{\phi \in A_1({\mathcal {D}})} \langle -E'(G_{m-1}),\phi \rangle \ge t_m B^{-1} \langle -E'(G_{m-1}),f^\varepsilon \rangle . \end{aligned}$$

We set \(w^*:=\lambda t_mB^{-1}\) and obtain

$$\begin{aligned}&E(G_{m-1}-w^*G_{m-1}+\lambda \varphi _m)\nonumber \\&\quad \le E(G_{m-1})-\lambda t_mB^{-1}\langle -E'(G_{m-1}),f^\varepsilon -G_{m-1}\rangle . \end{aligned}$$
(3.24)

By (3.4), we obtain

$$\begin{aligned} \langle -E'(G_{m-1}),f^\varepsilon -G_{m-1}\rangle \ge E(G_{m-1})-E(f^\varepsilon ). \end{aligned}$$

Thus,

$$\begin{aligned}&\inf _{\lambda \ge 0,w}E(G_{m-1}-wG_{m-1}+\lambda \varphi _m)\le E(G_{m-1}) + \inf _{\lambda \ge 0}(-\lambda t_mB^{-1}(E(G_{m-1})\nonumber \\&\quad -\,E(f^\varepsilon )) + 2\rho (E,\Vert \lambda \varphi _m -w^*G_{m-1}\Vert )). \end{aligned}$$
(3.25)

We now estimate

$$\begin{aligned} \Vert w^*G_{m-1}-\lambda \varphi _m\Vert \le w^*\Vert G_{m-1}\Vert +\lambda . \end{aligned}$$

Next, \(G_{m-1}\in D\subset D_1\). Our assumption on boundedness of \(D_1\) implies that \(\Vert G_{m-1}\Vert \le C_1:=\) diam\((D_1)\). Thus, under assumption \(B\ge 1\), we get

$$\begin{aligned} w^*\Vert G_{m-1}\Vert \le C_1\lambda t_m \le C_1\lambda . \end{aligned}$$

Finally,

$$\begin{aligned} \Vert w^*G_{m-1}-\lambda \varphi _m\Vert \le C_0\lambda . \end{aligned}$$

This completes the proof of Lemma 3.7. \(\square \)

By the definition of \(G_m\)

$$\begin{aligned} E(G_m)\le \inf _{\lambda \ge 0,w}E(G_{m-1}-wG_{m-1}+\lambda \varphi _m)+\delta . \end{aligned}$$

In the case of exact evaluations in the WGAFR(co), we had the monotonicity property \(E(G_0)\ge E(G_1)\ge \cdots \) which implied that \(G_n\in D_0\) for all \(n\). In the case of the \(\hbox {WGAFR}(\delta )\) Lemma 3.7 with \(D=D_{(m-1)\delta }, m\le \delta ^{-1/q}\) implies

$$\begin{aligned} E(G_m)\le E(G_{m-1})+\delta . \end{aligned}$$
(3.26)

Therefore, \(G_m\in D_{m\delta }\) and for all \(m\le N:=[\delta ^{-1/q}]\)

$$\begin{aligned} E(G_m)\le E(0)+1, \end{aligned}$$

which implies \(G_n\in D_1\) for all \(n\le N\).

Denote

$$\begin{aligned} a_n:=E(G_n)-E(f^\varepsilon ). \end{aligned}$$

The number \(B\) in Lemma 3.7 can be taken arbitrarily close to \(A(\varepsilon )\). Therefore, inequality (3.22) implies

$$\begin{aligned} a_m\le a_{m-1} +\inf _{\lambda \ge 0}(-\lambda t A(\varepsilon )^{-1}a_{m-1} +2\gamma (C_0\lambda )^q) +\delta . \end{aligned}$$

It is similar to (3.17) with the only point that we now cannot guarantee that \(a_{m-1}\ge 0\). However, if \(n\) is the smallest number from \([1,N]\) such that \(a_n<0\) then for \(m\in [n,N]\) (3.26) implies easily \(a_m\le Cm^{1-q}\). Thus, it is sufficient to assume that \(a_n\ge 0\). We apply Lemma 3.4 with \(v=t A(\varepsilon )^{-1}, B=2\gamma C_0^q\) and complete the proof.

We have discussed above two algorithms the \(\hbox {WRGA}(\delta )\) and the \(\hbox {REGA}(\delta )\). Results for the \(\hbox {REGA}(\delta )\) (see Theorem 3.5) were derived from the proof of the corresponding results for the \(\hbox {WRGA}(\delta )\) (see Theorem 3.1). We now discuss a companion algorithm for the \(\hbox {WGAFR}(\delta )\) that uses only function evaluations.

\(E\)-Greedy algorithm with free relaxation and error \(\delta \mathbf {(EGAFR(}\delta \mathbf{))}\). We define \(G_0:=0\). For \(m\ge 1\), assuming \(G_{m-1}\) has already been defined, we take \(\varphi _{m} \in \mathcal {D}\) and \(\alpha _m\), \(\beta _{m} \in \mathbb {R}\) satisfying

$$\begin{aligned} E(\alpha _mG_{m-1} + \beta _m\varphi _m) \le \inf _{ \alpha ,\beta \in {\mathbb {R}};g\in \mathcal{D}}E(\alpha G_{m-1} + \beta g) +\delta \end{aligned}$$

and define

$$\begin{aligned} G_m:= \alpha _m G_{m-1} + \beta _m\varphi _m. \end{aligned}$$

In the same way as Theorem 3.5 was derived from the proof of Theorem 3.1, one can derive the following theorem from the proof of Theorem 3.6.

Theorem 3.8

Let \(E\) be a uniformly smooth convex function on \(X\) with modulus of smoothness \(\rho (E,D_1,u)\le \gamma u^q, 1<q\le 2\) and let \(\displaystyle {E^*:=\inf _{x\in X}E(x)=\inf _{x\in D_0}E(x)}\). Then, for the \(\hbox {EGAFR}(\delta )\), we have

$$\begin{aligned} E(G_m)-E^* \le C(E,q,\gamma ) \epsilon _m,\quad m\le \delta ^{-1/q} \end{aligned}$$
(3.27)

where

$$\begin{aligned} \epsilon _m:= \inf \{\epsilon :\ A(\varepsilon )^q m^{1-q}\le \epsilon \} \end{aligned}$$
(3.28)

and \(A(\epsilon )\) is defined by (1.9).

Theorem 2.4 provides the rate of convergence of the \(\hbox {EGA}(\mathcal{C})\) where we assume that function evaluations are exact and we can find \(\inf _{g\in \mathcal{D}}\) exactly. However, in practice, we very often cannot evaluate functions exactly and (or) cannot find the exact value of the \(\inf _{g\in \mathcal{D}}\). In order to address this issue, we modify the \(\hbox {EGA}(\mathcal{C})\) into the following algorithm \(\hbox {EGA}(\mathcal{C},\delta )\).

\(E\)-Greedy algorithm with coefficients \(\mathcal{C}\) and error \(\delta (\hbox {EGA}(\mathcal{C},\delta ))\). Let \(\delta \in (0,1]\). We define \(G_0:=0\). Then, for each \(m\ge 1\), we have the following inductive definition.

  1. (1)

    \(\varphi _m^\delta \in \mathcal{D}\) is such that

    $$\begin{aligned} E(G_{m-1}+c_m\varphi _m^\delta )\le \inf _{g\in \mathcal{D}}E(G_{m-1}+c_m g)+\delta . \end{aligned}$$
  2. (2)

    Let

    $$\begin{aligned} G_m:=G_{m-1}+c_m\varphi _m^\delta . \end{aligned}$$

We prove an analog of Theorem 2.4 for the \(\hbox {EGA}(\mathcal{C},\delta )\).

Theorem 3.9

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\le \gamma u^q,\, q\in (1,2]\) on \(D_3\). We set \(s:=\frac{2}{1+q}\) and \(\mathcal{C}_s:=\{ck^{-s}\}_{k=1}^\infty \) with \(c\le 1\) chosen in such a way that \(\gamma c^q \sum \nolimits _{k=1}^\infty k^{-sq} \le 1\). Then, the \(\hbox {EGA}(\mathcal{C}_s,\delta )\) provides the following rate: for any \(r\in (0,1-s)\)

$$\begin{aligned} E(G_m)-E^* \le C(r,q,\gamma )m^{-r},\qquad m\le \delta ^{-\frac{1}{1+r}}, \end{aligned}$$

where \(\displaystyle {E^*:=\inf _{x\in A_1(\mathcal{D})}E(x)}\).

We first accumulate some results that we will use in the proof of this theorem. Let \(N:=[\delta ^{-\frac{1}{1+r}}]\), where \([a]\) is the integer part of \(a\) and let \(G_m, m\ge 0\) be the sequence generated by the \(\hbox {EGA}(\mathcal{C}_s,\delta )\).

Claim 1

\(G_m\in D_3\), i.e., \(E(G_m)\le E(0)+3\), for all \(0\le m\le N\).

To see this, let \(t\in (0,1)\) and \(\varphi _m\) be such that

$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge tE_\mathcal{D}(G_{m-1}),\quad E_\mathcal{D}(G):=\sup _{g\in \mathcal{D}}\langle -E'(G),g\rangle . \end{aligned}$$
(3.29)

Then

$$\begin{aligned} \inf _{g\in \mathcal{D}}E(G_{m-1}+c_mg)\le E(G_{m-1}+c_m\varphi _m). \end{aligned}$$

Thus, it is sufficient to estimate \(E(G_{m-1}+c_m\varphi _m)\) with \(\varphi _m\) satisfying (3.29). By (3.5) under assumption that \(G_{m-1}\in D_3\), we get with \(\mu (u):=\gamma u^q\)

$$\begin{aligned} E(G_{m-1}+c_m\varphi _m)\le E(G_{m-1}) +c_m\langle E'(G_{m-1}),\varphi _m\rangle +2\mu (c_m). \end{aligned}$$

Using the definition of \(\varphi _m\), we obtain

$$\begin{aligned} E(G_{m-1}+c_m\varphi _m)\le E(G_{m-1})-c_mtE_\mathcal{D}(G_{m-1})+2\mu (c_m). \end{aligned}$$
(3.30)

We now prove by induction that \(G_m\in D_3\) for all \(m\le N\). Indeed, clearly \(G_0\in D_3\). Suppose that \(G_k\in D_3, k=0,1,\ldots , m-1\), then (3.30) holds for all \(k=1,\ldots ,m\) instead of \(m\) and, therefore,

$$\begin{aligned} E(G_m)\le E(0)+2\sum _{k=1}^m\mu (c_k)+m\delta \le E(0)+3, \end{aligned}$$

proving the claim.

We also need the following lemma from [18].

Lemma 3.10

If \(f\in \mathcal{L}_A\), then for

$$\begin{aligned} G_k:=\sum _{j=1}^kc_j\varphi _j,\quad \varphi _j\in \mathcal{D},\quad j=1,\ldots ,k, \end{aligned}$$

we have

$$\begin{aligned} E_\mathcal{D}(G_k)\ge (E(G_k)-E(f))/(A+A_k),\quad A_k:=\sum _{j=1}^k|c_j|. \end{aligned}$$

Proof of Theorem 3.9

\(E\) attains \(E^*\) at a point \(x^*\in A_1(\mathcal{D})\). If we start with (3.30) and then use the above lemma with \(f=x^*\), fact that we obtain

$$\begin{aligned} E(G_{m} )\le E(G_{m-1})-\frac{tc_m(E(G_{m-1})-E^*)}{1+A_{m-1}}+2\gamma c_m^q+\delta . \end{aligned}$$
(3.31)

The left-hand side of (3.31) does not depend on \(t\), therefore the inequality holds with \(t=1\):

$$\begin{aligned} E(G_{m} )\le E(G_{m-1})-\frac{c_m(E(G_{m-1})-E^*)}{1+A_{m-1}}+2\gamma c_m^q+\delta . \end{aligned}$$
(3.32)

We have

$$\begin{aligned} A_{m-1}=c\sum _{k=1}^{m-1}k^{-s} \le c\left( 1+\int _1^mx^{-s}\mathrm{d}x\right) =c(1+(1-s)^{-1}(m^{1-s}-1))) \end{aligned}$$

and

$$\begin{aligned} 1+A_{m-1}\le 1+ c(1-s)^{-1}m^{1-s}. \end{aligned}$$

Therefore, for \(m\ge C_1\), we have with \(v:=(r+1-s)/2\)

$$\begin{aligned} \frac{c_m}{1+A_{m-1}}\ge \frac{v+1-s}{2(m-1)}. \end{aligned}$$
(3.33)

To conclude the proof, we need the following technical lemma. This lemma is a more general version of Lemma 2.1 from [14] (see also Remark 5.1 in [15] and Lemma 2.37 on p. 106 of [16]).\(\square \)

Lemma 3.11

Let four positive numbers \(\alpha < \beta \le 1, A, U\in {\mathbb {N}}\) be given and let a sequence \(\{a_n\}_{n=1}^\infty \) have the following properties: \( a_1<A\) and we have for all \(n\ge 2\)

$$\begin{aligned} a_n\le a_{n-1}+A(n-1)^{-\alpha }; \end{aligned}$$
(3.34)

if for some \(\nu \ge U\) we have

$$\begin{aligned} a_\nu \ge A\nu ^{-\alpha } \end{aligned}$$

then

$$\begin{aligned} a_{\nu + 1} \le a_\nu (1- \beta /\nu ). \end{aligned}$$
(3.35)

Then, there exists a constant \(C=C(\alpha , \beta ,A,U )\) such that for all \(n=1,2,\ldots \) we have

$$\begin{aligned} a_n \le C n^{-\alpha }. \end{aligned}$$

We apply this lemma with \(a_n:= E(G_n)-E^*, n\le N, a_n:=0, n>N, \alpha :=r, \beta :=v:=(r+1-s)/2, U=C_1\) and \(A\) specified later. Let us check the conditions (3.34) and (3.35) of Lemma 3.11. It is sufficient to check these conditions for \(m<N\). By the inequality

$$\begin{aligned} E(G_m)\le E(G_{m-1})+2\rho (E,c_m) +\delta \le E(G_{m-1})+2\gamma c^q m^{-sq} +\delta \end{aligned}$$

the condition (3.34) holds for \(A\ge 2\gamma c^q+1\). Using \(sq\ge 1+r\) we get

$$\begin{aligned} c_m^q = c^q m^{-sq}\le c^q m^{-1-r},\qquad \delta \le m^{-1-r}. \end{aligned}$$
(3.36)

Assume that \(a_m\ge Am^{-r}\). Setting \(A\) to be big enough to satisfy

$$\begin{aligned} \delta +2\gamma c_m^q \le \frac{A(1-s-\beta )}{2m^{1+r}} \end{aligned}$$

we obtain from (3.32), (3.33), and (3.36)

$$\begin{aligned} a_{m+1}\le a_m(1-\beta /m) \end{aligned}$$

provided \(a_m\ge Am^{-r}\). Thus, (3.35) holds. Applying Lemma 3.11, we get

$$\begin{aligned} a_m\le C(r,q,\gamma )m^{-r}. \end{aligned}$$

This completes the proof of Theorem 3.9.

4 Univariate Convex Optimization

The relaxation step in each of the above algorithms involves either a univariate or bivariate optimization of a convex function. The univariate optimization problem called line search is well studied in optimization theory (see [10]). The purpose of the remaining two sections of this paper is to show that such problems can be solved efficiently. Results of these two sections are known. We present them here for completeness.

In this section, we consider the class \(F\) of convex on \([0,1]\) functions which belong to Lip \(1\) class with constant \(1\). We are interested in how many function evaluations are needed in order to find for a given \(\epsilon >0\) and a given \(f\in F\) a point \(x^\epsilon \in [0,1]\) such that

$$\begin{aligned} f(x^\epsilon ) \le \min _{x\in [0,1]} f(x) +\epsilon ? \end{aligned}$$

We begin with a known upper bound.

Proposition 4.1

If the algorithm, described below in the proof of Proposition 4.2, with \(\delta =0\) is applied to any \(f\in F\) and \(m\in {\mathbb {N}}\), then after \(3+2m\) function evaluations, it produces a point \(x_m\in [0,1]\) such that

$$\begin{aligned} f(x_m) \le \min _{x\in [0,1]} f(x) + 2^{-m}. \end{aligned}$$
(4.1)

We next analyze what happens if we do not receive the exact values of \(f\) when we query in the above algorithm. We assume that when we query \(f\) at a point \(x\), we receive the corrupted value \(y(x)\) where \(|f(x)-y(x)|\le \delta \) for each \(x\in [0,1]\). We assume that we know \(\delta \).

Proposition 4.2

Suppose we make function evaluations with an error \(\delta \). The algorithm described below applied to \(f\in F\) and \(m\in {\mathbb {N}}\) takes \(3+2m\) function evaluations and produces a point \(x_m\in [0,1]\) such that

$$\begin{aligned} f(x_m) \le \min _{x\in [0,1]} f(x) + 2^{-m} + (4m+1)\delta . \end{aligned}$$
(4.2)

Proof

In the argument that follows, we use the following property of convex functions. For any \(0\le a<b\le c<d\le 1\) we have

$$\begin{aligned} \frac{f(b)-f(a)}{b-a} \le \frac{f(d)-f(c)}{d-c}. \end{aligned}$$
(4.3)

Proof of Proposition 4.2 goes by cases. At the first iteration, we evaluate our function at \(0,1/2,1\). Without loss of generality, we assume that \(y(0)\le y(1)\).

A. Suppose \(y(0)\le y(1/2)\). Then \(f(0) \le f(1/2) +2\delta \) and by (4.3) with \(a=0, b=1/2, c=1/2, d=x, x\in (1/2,1]\) we obtain

$$\begin{aligned} f(x)\ge f(1/2)-2\delta , \quad x\in [1/2,1]. \end{aligned}$$

Therefore, restricting our search for a minimum to \([0,1/2]\), we make an error of at most \(2\delta \).

B. Suppose \(y(1/2)<y(0)\). In this case, we make an additional evaluation of the function at \(1/4\).

Ba. Suppose \(y(1/4)< y(1/2)-2\delta \). Then \(f(1/4)<f(1/2)\) and by (4.3), we obtain that

$$\begin{aligned} \min _{x\in [1/2,1]} f(x) \ge \min _{x\in [0,1/2]} f(x). \end{aligned}$$

Therefore, we can again restrict our search to the interval \([0,1/2]\).

Bb. Suppose \(y(1/4)\ge y(1/2)-2\delta \). In this case, we make an additional evaluation of the function at \(3/4\). If \(y(3/4)< y(1/2)-2\delta \) then as in Ba we can restrict our search to the interval \([1/2,1]\). If \(y(3/4)\ge y(1/2)-2\delta \) we argue as in the case A and obtain

$$\begin{aligned} \min _{x\in [0,1/4]} f(x) \ge \min _{x\in [1/4,1/2]} f(x) - 4\delta ,\quad \min _{x\in [3/4,1]} f(x) \ge \min _{x\in [1/2,3/4]} f(x)-4\delta . \end{aligned}$$

Therefore, we restrict our search to the interval \([1/4,3/4]\) with an error at most \(4\delta \).

At each iteration, we add two evaluations and then find that we can restrict our search to an interval of half the size of the original while incurring an additional error at most \(4\delta \). Finally, the evaluation of \(y\) gives us an error at most \(\delta \) with that of \(f\). \(\square \)

We note that convexity of functions from \(F\) plays a dominating role in obtaining exponential decay of error in Proposition 4.1. For instance, the following simple known statement holds for the \(\hbox {Lip}_{1}{1}\) class.

Proposition 4.3

Let \({\mathcal {A}} (m)\) denote the class of algorithms (adaptive) which use at most \(m\) function evaluations and provide an approximate for the minimum value of a function. Then,

$$\begin{aligned} \inf _{A\in {\mathcal {A}}(m)}\sup _{f\in \mathrm{Lip}_11} |\min _{x\in [0,1]}f(x)-A(f)| = \frac{1}{4m}. \end{aligned}$$

Proof

The upper bound follows from evaluating \(f\) at the midpoints \(x_j\) of the intervals \([(j-1)/m,j/m], j=1,\ldots ,m\) and giving the approximate value \(\min \nolimits _{j}f(x_j)-\frac{1}{4m}\). The lower bound follows from the following observation. For any \(m\) points \(0\le \xi _1<\xi _2<\cdots <\xi _m\le 1\) there are two functions \(f_1,f_2\in \mathrm{Lip}_11\) such that \(f_1(\xi _j)=f_2(\xi _j) =0\) for all \(j\) and \(\min \nolimits _{x}f_1(x) -\min \nolimits _xf_2(x) \ge \frac{1}{2m}\). \(\square \)

5 Multivariate Convex Optimization

In this section, we discuss an analog of Proposition 4.1 for \(d\)-variate convex functions on \([0,1]^d\). The \(d\)-variate algorithm is a coordinate wise application of the algorithm from Proposition 4.1 with an appropriate \(\delta \). We begin with a simple lemma.

Lemma 5.1

Let \(f(x), x=(x_1,\ldots ,x_d)\in [0,1]^d\) be a convex on \([0,1]^d\) function. Define \(x^d:=(x_1,\ldots ,x_{d-1})\in [0,1]^{d-1}\) and

$$\begin{aligned} f_d(x^d):=\min _{x_d}f(x). \end{aligned}$$

Then \(f_d(x^d)\) is a convex function on \([0,1]^{d-1}\).

Proof

Let \(u,v \in [0,1]^{d-1}\). Then, there are two points \(w,z \in [0,1]^d\) such that

$$\begin{aligned} f_d(u)=f(w),\qquad f_d(v)=f(z) \end{aligned}$$

and \(u=w^d, v=z^d\). From the convexity of \(f(x)\), we have

$$\begin{aligned} f(tw+(1-t)z)\le tf(w)+(1-t)f(z)= tf_d(u)+(1-t)f_d(v),\quad t\in [0,1].\nonumber \\ \end{aligned}$$
(5.1)

Clearly,

$$\begin{aligned} f_d((tw+(1-t)z)^d)\le f(tw+(1-t)z),\quad t\in [0,1]. \end{aligned}$$
(5.2)

Inequalities (5.1) and (5.2) imply that \(f_d(u)\) is convex. \(\square \)

Proposition 5.1

The \(d\)-variate minimization algorithm given below takes as input any \(f\in F\) and \(m\in {\mathbb {N}}\) and produces after \((3+2m)^d\) function evaluations a point \(x_m\in [0,1]^d\) such that

$$\begin{aligned} f(x_m) \le \min _{x\in [0,1]} f(x) + 2^{-m}(4m+2)^d. \end{aligned}$$
(5.3)

Proof

We construct the algorithm by induction. In the case \(d=1\), we use the univariate algorithm from Proposition 4.2. Suppose, we have given the algorithm such that the proposition holds for \(d-1\). Then, we write

$$\begin{aligned} \min _xf(x) = \min _{x_d} \min _{x^d} f(x) \end{aligned}$$

and observe that by Lemma 5.1 the function \(\displaystyle {g(x_d):=\min \nolimits _{x^d} f(x)}\) is a convex function. Next, we apply the algorithm from Proposition 4.2 with \(\delta = 2^{-m}(4m+2)^{d-1}\) to the function \(g\). By our induction assumption, we evaluate \(g\) with an error at most \(\delta \). Thus, by Proposition 4.2, we get an error at most

$$\begin{aligned} 2^{-m}+(4m+1)\delta \le 2^{-m}(4m+2)^d. \end{aligned}$$

The total number of evaluations is \((3+2m)^d\). This completes the proof. \(\square \)

6 Conclusion

We continue a study, which was initiated by Zhang [20]. The biggest contribution of this paper is that it gives a dimension independent analysis of unconstrained convex optimization. For that purpose we use algorithms with free relaxation—the \(\hbox {WGAFR}(\delta )\) and the \(\hbox {EGAFR}(\delta )\). An important difference between these algorithms and the one introduced and studied in [20]—\(\hbox {REGA}(\delta )\)—is that the \(\hbox {REGA}(\delta )\) is limited to convex combinations \((1-\lambda _m)G_{m-1} +\lambda _m\varphi _m\) and, therefore, it is only applicable for minimization over \(A_1({\mathcal {D}})\). Also, we point out that our analysis is different from that in [20]. In both approaches, the reduction \(E(G_{m-1})-E(G_m)\) at one iteration is analyzed. We analyze it using \(\varphi _m\) satisfying the greedy condition:

$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal {D}}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$

In [20] the averaging technique is used. Our technique works for the WRGA, REGA, WGAFR, and EGAFR.

One more important feature of this paper is that we use function evaluations and do not utilize the gradient \(E'\). Clearly, in the setting on an infinite dimensional Banach space, the proposed algorithms are not algorithms in a strict sense. However, when \(X\) is finite dimensional and \({\mathcal {D}}\) is finite, they are algorithms in a strict sense. In such a situation we can compare complexities of, say, the \(\hbox {WGAFR}(\delta )\), which utilizes \(E'\), and the \(\hbox {EGAFR}(\delta )\), which does not. At the greedy step (1) of the \(\hbox {WGAFR}(\delta )\), generally speaking, we need to evaluate all \(\langle -E'(G_{m-1}),g\rangle , g\in {\mathcal {D}}\), in order to choose \(\varphi _m\). At the greedy step of the \(\hbox {EGAFR}(\delta )\), we need to solve \(N:=|{\mathcal {D}}|\) two-dimensional convex optimization problems. Proposition 5.1 shows that this extra work of optimization requires about \((\log 1/\delta )^2\) function evaluations per dictionary element. Therefore, roughly, for the \(\hbox {WGAFR}(\delta )\), we need to evaluate \(N\) inner products, and for the \(\hbox {EGAFR}(\delta )\), we need to make \(N(\log 1/\delta )^2\) function evaluations. This comparison is under assumption that in the case of the \(\hbox {WGAFR}(\delta )\) the gradient \(E'(G_{m-1})\) is known.

The most important results of the paper are in Sect. 3, where we allow approximate evaluations. Theorems 3.13.53.6, and 3.8, proved in that section, demonstrate that for the number of iterations \(m\le \delta ^{-1/q}\) the error \(\delta \) in approximate evaluations does not effect the upper bound of the error of the optimization algorithm. We do not know if the restriction \(m\le \delta ^{-1/q}\) is the best possible in these theorems. It is known and easy to check on examples from approximation (see, for instance, [16, p. 346]) that the error rate \(m^{1-q}\) for optimization over \(A_1({\mathcal {D}})\) is the best possible.