1 Introduction

To sparse approximate solutions to convex optimization problems, let us apply the technique known in nonlinear approximation as greedy approximation. Greedy approximation has important applications in signal processing, optimization, statistics, and stochastic PDEs. Greedy algorithms are thoroughly studied in approximation theory, functional analysis, learning theory, signal processing, and optimization. Very often, researchers working in one area are not aware of parallel techniques developed in other areas of research. The main goal of this paper is to demonstrate how typical methods developed in greedy approximation in Banach spaces can be used for studying convex optimization problems.

A typical problem of sparse approximation is the following [15, 40]. Let \(X\) be a Banach space with norm \(\Vert \cdot \Vert \) and \({\mathcal D}\) be a set of elements of \(X\). For a given \({\mathcal D}\), consider the set of all \(m\)-term linear combinations with respect to \({\mathcal D}\) (\(m\)-sparse with respect to \({\mathcal D}\) elements):

$$\begin{aligned} \Sigma _m({\mathcal D}):=\left\{ x\in X: x=\sum _{i=1}^m c_ig_i,\quad g_i\in {\mathcal D}\right\} . \end{aligned}$$

We are interested in approximation of a given \(f\in X\) by elements of \(\Sigma _m({\mathcal D})\). The best we can do is

$$\begin{aligned} \sigma _m(f,{\mathcal D}) := \inf _{x\in \Sigma _m({\mathcal D})}\Vert f-x\Vert . \end{aligned}$$
(1)

Greedy algorithms in approximation theory are designed to provide a simple way to build good approximants of \(f\) from \(\Sigma _m({\mathcal D})\). Clearly, problem (1) is an optimization problem of \(E_f(x):=\Vert f-x\Vert \) over the manifold \(\Sigma _m({\mathcal D})\).

A typical problem of convex optimization is to find an approximate solution to the problem

$$\begin{aligned} \inf _x E(x) \end{aligned}$$
(2)

under the assumption that \(E\) is a convex function. In the case that we are optimizing over the whole space \(X\), it is called an unconstrained optimization problem. In many cases, we are interested either in optimizing over \(x\) of special structure (for instance, \(x\in \Sigma _m({\mathcal D})\), as above) or in optimizing over \(x\) from a given domain \(D\) (constrained optimization problem). Greedy algorithms are used for finding an approximate solution of special structure for problem (2).

Usually in convex optimization, the function \(E\) is defined on a finite dimensional space \({\mathbb R}^n\) [9, 31]. Recent needs of numerical analysis call for consideration of the above optimization problem on an infinite dimensional space, for instance, a space of continuous functions. One more important motivating argument to study this problem in the infinite dimensional setting is that in many contemporary numerical applications, ambient space \({\mathbb R}^n\) involves a large dimension \(n\) and we would like to obtain bounds on the convergence rate independent of the dimension \(n\). Our results for infinite dimensional spaces provide such bounds on the convergence rate. Thus, we consider a convex function \(E\) defined on a Banach space \(X\). It is pointed out in [46] that in many engineering applications, researchers are interested in an approximate solution of problem (2) as a linear combination of a few elements from a given system \({\mathcal D}\) of elements. There is an increasing interest in building such sparse approximate solutions using different greedy-type algorithms (see, for instance, [1, 35, 11, 12, 17, 19, 2224, 34, 45, 46]). We refer the reader to the papers [1] and [23] for concise surveys of recent results on greedy algorithms from the point of view of convex optimization and signal processing.

The fundamental question is how to construct good methods (algorithms) of approximation. Recent results have established that greedy-type algorithms are suitable methods of nonlinear approximation in sparse approximation both with regard to bases and with regard to redundant systems. In fact one fundamental principle allows us to build good algorithms both for arbitrary redundant systems and for very simple, well structured bases such as the Haar basis: the use of a greedy step in searching for a new element to be added to a given sparse approximant. By a greedy step, we mean one which maximizes a certain functional determined by information from the previous steps of the algorithm. Varying that functional and the ways of constructing (choosing coefficients of the linear combination) the \(m\)-term approximant from the already found \(m\) elements of the dictionary yields different types of greedy algorithms. For instance, if the corresponding linear combination is a convex combination, then it is the Relaxed Greedy Algorithm (Frank-Wolfe-type algorithm) studied in Sect. 2. In Sect. 3, we study the Chebyshev-type greedy algorithm, which is known in signal processing under the name Fully Corrective Forward Greedy Selection. We use the name Chebyshev in this algorithm because at the approximation step of the algorithm, we use a best approximation operator which bears the name of the Chebyshev projection or the Chebyshev operator. In the case of Hilbert space, the Chebyshev projection is the orthogonal projection, and it is reflected in the name of the algorithm. In this paper, we discuss the weak version of the greedy algorithms. The term weak in the definition of these algorithms means that at the greedy step of selection of a new element of the dictionary, we do not shoot for the optimal element of the dictionary which realizes the corresponding supremum, but are satisfied with a weaker property than being optimal. The obvious reason for this is that we do not know in general that the optimal element exists. Another, practical reason is that the weaker the assumption, the easier it is to satisfy it and, therefore, the easier it is to realize in practice. We note that results of this paper provide the same upper bounds for the rate of convergence for the weak versions of the algorithms (in the case \(t_k=t\)) as for the strong versions (\(t=1\)) of the algorithms.

At first glance, the settings of approximation and optimization problems are very different. In the approximation problem, an element \(f\in X\) is given, and our task is to find a sparse approximation of it. In optimization theory, an energy function (loss function) \(E(x)\) is given, and we should find an approximate sparse solution to the minimization problem. It turns out that the same technique can be used for solving both problems.

We show how the technique developed in nonlinear approximation theory, in particular the greedy approximation technique, can be adjusted to find a sparse, with respect to \({\mathcal D}\), solution of the problem (2). We consider three greedy algorithms here: the Weak Chebyshev Greedy Algorithm (WCGA), the Weak Relaxed Greedy Algorithm (WRGA), and the Weak Greedy Algorithm with Free Relaxation (WGAFR). The names of these algorithms used above are from approximation theory. The WCGA is a generalization to a Banach space setting [36] of the Weak Orthogonal Greedy Algorithm (WOGA), which is very important in signal processing and compressed sensing. The WOGA is known in signal processing under the name Weak Orthogonal Matching Pursuit (WOMP). It is used for exact recovery of sparse signals and for approximation of signals by sparse ones. An analog of the WCGA in convex optimization was introduced in [34] under the name Fully Corrective Forward Greedy Selection. The WRGA is the approximation theory analog of the classical Frank-Wolfe algorithm, introduced in [21] and studied in many papers (see, for instance, [12, 14, 17, 18, 22, 23]). This algorithm was rediscovered in statistics and approximation theory in [2] and [25] (see [40] for further discussion).

Two different ideas have been used in the WRGA and WCGA. The first idea was that of relaxation (we use here terminology from approximation theory). The corresponding algorithms, (including WRGA) were designed for approximation of functions from the convex hull conv(\({\mathcal D}\)) of the given system (dictionary) \({\mathcal D}\). The second idea is used in the WCGA to build the best approximant from the \(\mathrm{span }(\varphi _1,\ldots ,\varphi _m)\) of already chosen elements \(\varphi _j\in {\mathcal D}\), instead of the use of only one element \(\varphi _m\) for an update of the approximant, as is done in WRGA.

The realization of both ideas resulted in the construction of algorithms (WRGA and WCGA) that are good for approximation of functions from conv(\({\mathcal D}\)). The advantage of WCGA over WRGA is that WCGA (under some assumptions on the weakness sequence \(\tau \)) converges for each \(f\in X\) in any uniformly smooth Banach space. The WRGA is simpler than the WCGA in the sense of computational complexity. However, the WRGA has limited applicability. It converges only for elements of the closure of the convex hull of a dictionary.

The WGAFR [38] combines good features of both the WRGA and the WCGA algorithms. The WGAFR performs in the same way as the WCGA from the point of view of convergence and rate of convergence, and outperforms the WCGA in terms of computational complexity. In the WGAFR, we are optimizing over two parameters at each step of the algorithm. In other words, we are looking for the best approximation from a \(2\)-dimensional linear subspace at each step. As far as we know, an analog of the WGAFR has not been studied in optimization.

A number of greedy algorithms were successfully used in compressed sensing. The early theoretical results [27] on the widths did not consider the question of practical recovery methods. The celebrated contribution of the work by Candes-Tao and Donoho was to show that the recovery can be done by the \(\ell _1\) minimization. While the \(\ell _1\) minimization technique plays an important role in designing computationally tractable recovery methods, its complexity is still impractical for many applications. An attractive alternative to the \(\ell _1\) minimization is a family of greedy algorithms. They include the Orthogonal Greedy Algorithm [called the Orthogonal Matching Pursuit (OMP) in signal processing], the Regularized Orthogonal Matching Pursuit [30], Compressive Sampling Matching Pursuit (CoSaMP) [29], and the Subspace Pursuit (SP) [13]. The OMP is simpler than CoSaMP and SP, however, at the time of invention of CoSaMP and SP these algorithms provided exact recovery of sparse signals and the Lebesgue-type inequalities for dictionaries satisfying the Restricted Isometry Property (RIP) [13, 29]. The corresponding results for the OMP were not known at that time. Later, Zhang [48] proved exact recovery of sparse signals and the Lebesgue-type inequalities for the OMP under RIP condition on a dictionary. The reader can find an extension of Zhang’s results for the Chebyshev Greedy Algorithm in [28] and [43].

The idea of extending greedy-type algorithms designed for approximation to solving optimization problems with sparsity constraints was recently used in a number of papers. Blumensath [4, 5] extends the Iterative Hard Thresholding (IHT) algorithm [8] to the optimization setting. Bahmani et al. [1] extend the CoSaMP to the optimization setting. Variants of the coordinate descent algorithms were considered in [3] and [19]. This confirms that the step from greedy approximation to greedy optimization to which this paper is devoted is a timely and important step. We only provide a theoretical analysis of the algorithms here. The earlier version of this paper is published in [41].

Before we proceed to a detailed discussion, we formulate some novel contributions of the paper.

Novelty We provide a unified program to study convergence and rate of convergence of greedy algorithms. This way works in both the approximation theory setting and in the convex optimization setting. It works for an arbitrary dictionary \({\mathcal D}\). We show that this technique works both for the classical Frank-Wolfe algorithm and for the new algorithms in convex optimization (WCGA(co) and WGAFR(co)). We introduce and study a new algorithm, the Weak Greedy Algorithm with Free Relaxation designed for convex optimization. It combines good features of known algorithms—the simplicity of the Frank-Wolfe-type algorithms and the power of fully corrective algorithms. All the theorems in Sects. 24 are new results. In the setting with an arbitrary dictionary \({\mathcal D}\), these results are sharp. This follows from sharpness of the approximation theory analogs of these results (see [40], Ch. 6). Better rate of convergence results can be obtained under stronger assumptions on \(E\) for dictionaries satisfying certain conditions [32, 34, 44].

We note that in the study of greedy-type algorithms in approximation theory [40], emphasis is put on the theory of approximation with respect to an arbitrary dictionary \({\mathcal D}\). The reader can find a detailed discussion of applications of greedy-type algorithms in approximation, classification, and boosting in [46]. The reader can find examples of specific dictionaries of interest in [17, 22, 24, 40, 45], and [23]. We present some results on sparse solutions for convex optimization problems in the setting with an arbitrary dictionary \({\mathcal D}\). In this paper, we analyze algorithms with exact evaluations. There are known results on algorithms with approximate evaluation: for the Frank-Wolfe algorithm, see [18]; for other convex optimization algorithms, see [16] and [46]; for approximation algorithms, see [37]. Clearly, the corresponding dictionary element \(\varphi _m\) that we choose at the greedy step may not be unique. Our results apply for any realization (any choice of \(\varphi _m\)) of the algorithms.

Greedy algorithms for convex optimization Let \(X\) be a Banach space with norm \(\Vert \cdot \Vert \). We say that a set of elements (functions) \({\mathcal D}\) from \(X\) is a dictionary, respectively, symmetric dictionary, if each \(g\in {\mathcal D}\) has norm bounded by one (\(\Vert g\Vert \le 1\)),

$$\begin{aligned} g\in {\mathcal D}\quad \text {implies} \quad -g \in {\mathcal D}, \end{aligned}$$

and the closure of \(\mathrm{span }{\mathcal D}\) is \(X\). For notational convenience, in this paper symmetric dictionaries are considered. Results of the paper also hold for non-symmetric dictionaries with straight forward modifications. For instance, no modifications are needed in the case of WRGA(co); in the case of WCGA(co), we need to work at the greedy step (C1) with absolute values of the quantities \(\langle -E'(G_{m-1}),\varphi _m\rangle \) and \(\langle -E'(G_{m-1}),g\rangle \). We denote the closure (in \(X\)) of the convex hull of \({\mathcal D}\) by \(A_1({\mathcal D})\). In other words \(A_1({\mathcal D})\) is the closure of conv(\({\mathcal D}\)). We use this notation because it has become a standard notation in relevant greedy approximation literature.

It is pointed out in [20] that there has been considerable interest in solving the convex unconstrained optimization problem

$$\begin{aligned} \min _{x}\frac{1}{2}\Vert y-\Phi x\Vert _2^2 +\lambda \Vert x\Vert _1, \end{aligned}$$
(3)

where \(x\in {\mathbb R}^n\), \(y\in {\mathbb R}^k\), \(\Phi \) is an \(k\times n\) matrix, \(\lambda \) is a nonnegative parameter, \(\Vert v\Vert _2\) denotes the Euclidian norm of \(v\), and \(\Vert v\Vert _1\) is the \(\ell _1\) norm of \(v\). Problems of the form (3) have become familiar over the past three decades, particularly in statistical and signal processing contexts. Problem (3) is closely related to the following convex constrained optimization problem:

$$\begin{aligned} \min _{x}\frac{1}{2}\Vert y-\Phi x\Vert _2^2 \quad \text {subject to}\quad \Vert x\Vert _1\le A. \end{aligned}$$
(4)

The above convex optimization problem can be recast as an approximation problem of \(y\) with respect to a dictionary \({\mathcal D}:=\{\pm \varphi _i\}_{i=1}^n\) which is associated with a \(k\times n\) matrix \(\Phi =[\varphi _1\dots \varphi _n]\) with \(\varphi _j\in {\mathbb R}^k\) being the column vectors of \(\Phi \). The condition \(y\in A_1({\mathcal D})\) is equivalent to the existence of \(x\in {\mathbb R}^m\) such that \(y=\Phi x\) and

$$\begin{aligned} \Vert x\Vert _{1}:=|x_1|+\cdots +|x_m| \le 1. \end{aligned}$$
(5)

As a direct corollary of Theorems 6.8 and 6.23 from [40], we get for any \(y\in A_1({\mathcal D})\) that the WCGA and the WGAFR with \(\tau =\{t\}\) guarantee the following upper bound for the error:

$$\begin{aligned} \Vert y_k\Vert _2\le Ck^{-1/2}, \end{aligned}$$
(6)

where \(y_k\) is the residual after \(k\) iterations. The bound (6) holds for any \({\mathcal D}\) (any \(\Phi \)).

For further discussion of an optimization problem more general than (3) and its application in machine learning, we refer the reader to [47].

We assume that the set

$$\begin{aligned} D:=\{x:E(x)\le E(0)\} \end{aligned}$$

is bounded. For a bounded set \(D\), define the modulus of smoothness of \(E\) on \(D\) as follows:

$$\begin{aligned} \rho (E,u):=\frac{1}{2}\sup _{x\in D, \Vert y\Vert =1}|E(x+uy)+E(x-uy)-2E(x)|. \end{aligned}$$
(7)

A typical assumption in convex optimization is of the form (\(\Vert y\Vert =1\)):

$$\begin{aligned} |E(x+uy)-E(x)-\langle E'(x),uy\rangle | \le Cu^2, \end{aligned}$$

which corresponds to the case \(\rho (E,u)\) of order \(u^2\). We assume that \(E\) is Fréchet differentiable. Then convexity of \(E\) implies that for any \(x,y\),

$$\begin{aligned} E(y)\ge E(x)+\langle E'(x),y-x\rangle , \end{aligned}$$
(8)

or, in other words,

$$\begin{aligned} E(x)-E(y) \le \langle E'(x),x-y\rangle = \langle -E'(x),y-x\rangle . \end{aligned}$$
(9)

The above assumptions that \(E\) is a Fréchet differentiable convex function and \(D\) is a bounded domain guarantee that \(\inf _xE(x) >-\infty \).

We note that in all algorithms studied in this paper, the sequence \(\{G_m\}_{m=0}^\infty \) of approximants satisfies the conditions

$$\begin{aligned} G_0=0,\quad E(G_0)\ge E(G_1) \ge E(G_2) \ge \ldots . \end{aligned}$$

This guarantees that \(G_m\in D\) for all \(m\).

We prove convergence and rate of convergence results here. Our setting in an infinite dimensional Banach space makes the convergence results nontrivial. The rate of convergence results are of interest in both finite dimensional and infinite dimensional settings. In these results, we make assumptions on the element minimizing \(E(x)\) (in other words we look for \(\inf _{x\in S}E(x)\) for a special domain \(S\)). A typical assumption in this regard is formulated in terms of the convex hull \(A_1({\mathcal D})\) of the dictionary \({\mathcal D}\).

We have already mentioned above (see (4) and below) an example which is of interest in applications in compressed sensing. We mention another example that attracted a lot of attention in the recent literature (see, for instance, [17, 23, 24, 45]). In this example, \(X\) is a Hilbert space of all real matrices of size \(n\times n\) equipped with the Frobenius norm \(\Vert \cdot \Vert _F\). A dictionary \({\mathcal D}\) is the set of all matrices of rank one normalized in the Frobenius norm. In this case, \(A_1({\mathcal D})\) is the set of matrices with nuclear norm not exceeding \(1\). We are interested in sparse minimization of \(E(x):=\Vert f-x\Vert _F^2\) (sparse approximation of \(f\)) with respect to \({\mathcal D}\).

2 The Frank-Wolfe-Type Algorithm

In this section, we discuss an algorithm for finding a sparse approximate solution for a constrained optimization problem

$$\begin{aligned} \inf _{x\in A_1({\mathcal D})} E(x). \end{aligned}$$

The Frank-Wolfe algorithm was introduced in [21] for solving a constrained optimization problem. Later similar algorithms were used in statistics and approximation theory. A characteristic feature of the Frank-Wolfe-type algorithm is that it is a greedy-type algorithm that builds at each iteration a new approximant as a convex combination of the previous approximant and a new element chosen from the dictionary. There are several versions of this algorithm. The reader can find a corresponding discussion in [23] and [40]. In this section, we study a generalization for optimization problems of relaxed greedy algorithms in Banach spaces considered in [36].

Let \(\tau := \{t_k\}_{k=1}^\infty \) be a given weakness sequence of numbers \(t_k \in [0,1]\), \(k=1,\dots \).

Weak Relaxed Greedy Algorithm (WRGA(co)) We define \(G_0:=G^{r,\tau }_0 := 0\). Then, for each \(m\ge 1\), we have the following inductive definition:

  1. (R1)

    \(\varphi _m := \varphi ^{r,\tau }_m \in {\mathcal D}\) is any element satisfying

    $$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m - G_{m-1}\rangle \ge t_m \sup _{g\in {\mathcal D}} \langle -E'(G_{m-1}),g - G_{m-1}\rangle . \end{aligned}$$
  2. (R2)

    Find \(0\le \lambda _m \le 1\) such that

    $$\begin{aligned} E((1-\lambda _m)G_{m-1} + \lambda _m\varphi _m) = \inf _{0\le \lambda \le 1}E((1-\lambda )G_{m-1} + \lambda \varphi _m), \end{aligned}$$

    and define

    $$\begin{aligned} G_m:= G^{r,\tau }_m := (1-\lambda _m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

Remark 2.1

It follows from the definition of the WRGA that the sequence \(\{E(G_m)\}\) is a nonincreasing sequence.

We call the WRGA(co) relaxed because at the \(m\)th step of the algorithm we use a linear combination (convex combination) of the previous approximant \(G_{m-1}\) and a new element \(\varphi _m\). The relaxation parameter \(\lambda _m\) in the WRGA(co) is chosen at the \(m\)th step depending on \(E\). We use in algorithms for convex optimization the notation \(G_m\) (\(G\) comes from Greedy) used in greedy approximation algorithms to stress that an \(m\)th approximant \(G_m\) is obtained by a greedy algorithm. Standard optimization theory notation for it is \(x^{m}\). In the case \(E(x)=E(f,x,q):= \Vert f-x\Vert ^q\), \(f\in X\), \(q\ge 1\), the WRGA(co) coincides with the Weak Relaxed Greedy Algorithm from approximation theory (see [40], S. 6.3).

We proceed to a theorem on convergence of the WRGA(co). In the formulation of this theorem, we need a special sequence which is defined for a given modulus of smoothness \(\rho (u)\) and a given \(\tau = \{t_k\}_{k=1}^\infty \).

Definition 2.2

Let \(\rho (E,u)\) be an even convex function on \((-\infty ,\infty )\) with the property

$$\begin{aligned} \lim _{u\rightarrow 0}\rho (E,u)/u =0. \end{aligned}$$

For any \(\tau = \{t_k\}_{k=1}^\infty \), \(0<t_k\le 1\), and \(\theta > 0\), we define \(\xi _m := \xi _m(\rho ,\tau ,\theta )\) as a number \(u\) satisfying the equation

$$\begin{aligned} \rho (E,u) = \theta t_m u. \end{aligned}$$
(10)

Remark 2.3

Assumptions on \(\rho (E,u)\) imply that the function

$$\begin{aligned} s(u) := \rho (E,u)/u, \quad u\ne 0,\quad s(0) =0, \end{aligned}$$

is a continuous increasing function on \([0,\infty )\). Thus (10) has a unique solution \(\xi _m=s^{-1}(\theta t_m)\) such that \(\xi _m>0\) for \(\theta \le \theta _0:=s(2)\). In this case, we have \( \xi _m(\rho ,\tau ,\theta )\le 2\).

Theorem 2.4

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\). Assume that a sequence \(\tau :=\{t_k\}_{k=1}^\infty \) satisfies the condition that for any \(\theta \in (0,\theta _0]\), we have

$$\begin{aligned} \sum _{m=1}^\infty t_m \xi _m(\rho ,\tau ,\theta ) =\infty . \end{aligned}$$

Then, for the WRGA(co), we have

$$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =\inf _{x\in A_1({\mathcal D})}E(x). \end{aligned}$$

The following theorem gives the upper bound on the rate of convergence. In the case of the strong version of the algorithm (\(t_k=1\)), the corresponding rates of convergence were obtained in [14].

Theorem 2.5

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u) \le \gamma u^q\), \(1<q\le 2\). Then, for a sequence \(\tau := \{t_k\}_{k=1}^\infty \), \(t_k \le 1\), \(k=1,2,\dots ,\) we have for any \(f\in A_1({\mathcal D})\) that

$$\begin{aligned} E(G_m)-E(f) \le \left( C_1(q,\gamma )+C_2(q,\gamma )\sum _{k=1}^m t_k^p\right) ^{1-q},\quad p:= \frac{q}{q-1}, \end{aligned}$$

with positive constants \(C_1(q,\gamma )\), \(C_2(q,\gamma )\) which may depend only on \(q\) and \(\gamma \).

3 The Weak Chebyshev Greedy Algorithm

In this section, we study the WCGA designed for convex optimization. Here we are interested in finding a solution to the unconstrained convex optimization problem (2) that is sparse with respect to a given dictionary \({\mathcal D}\). This algorithm provides a sequence \(\{G_m\}\) of sparse approximants such that under mild conditions on \(E\), the sequence \(\{E(G_m)\}\) converges to the minimal value of \(E\). Moreover, if we know that the point of minimum of \(E\) satisfies some conditions, then we guarantee the rate of convergence of \(\{E(G_m)\}\). We prove the results for general infinite dimensional Banach space \(X\). In case \(X\) is finite dimensional, say, of dimension \(N\), the complexity of the \(m\)th iteration will increase to the complexity of the original optimization problem when \(m=N\). This means that it makes sense to use the WCGA(co) with the number of iterations much smaller than \(N\).

We define the following generalization of the WCGA for convex optimization.

Weak Chebyshev Greedy Algorithm (WCGA(co)) We define \(G_0 := 0\). Then for each \(m\ge 1\), we have the following inductive definition:

  1. (C1)

    \(\varphi _m :=\varphi ^{c,\tau }_m \in {\mathcal D}\) is any element satisfying

    $$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal D}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$
  2. (C2)

    Define

    $$\begin{aligned} \Phi _m := \Phi ^\tau _m := \mathrm{span }\{\varphi _j\}_{j=1}^m, \end{aligned}$$

    and define \(G_m := G_m^{c,\tau }\) to be the point from \(\Phi _m\) at which \(E\) attains the minimum:

    $$\begin{aligned} E(G_m)=\inf _{x\in \Phi _m}E(x). \end{aligned}$$

In the case \(E(x)=E(f,x,q):= \Vert f-x\Vert ^q\), \(f\in X\), \(q\ge 1\), the WCGA(co) coincides with the Weak Chebyshev Greedy Algorithm from approximation theory (see [40], S. 6.2).

Theorem 3.1

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\). Assume that a sequence \(\tau :=\{t_k\}_{k=1}^\infty \) satisfies the condition that for any \(\theta \in (0,\theta _0]\), we have

$$\begin{aligned} \sum _{m=1}^\infty t_m \xi _m(\rho ,\tau ,\theta ) =\infty . \end{aligned}$$

Then

$$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =\inf _{x\in D}E(x). \end{aligned}$$

Here are two simple corollaries of Theorem 3.1.

Corollary 3.2

Let a convex function \(E\) have modulus of smoothness \(\rho (E,u)\) of power type \(1<q\le 2\), that is, \(\rho (E,u) \le \gamma u^q\). Assume that

$$\begin{aligned} \sum _{m=1}^\infty t_m^p =\infty , \quad p=\frac{q}{q-1}. \end{aligned}$$
(11)

Then

$$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =\inf _{x\in D}E(x). \end{aligned}$$

Corollary 3.3

Let a convex function \(E\) be uniformly smooth. Assume that \(t_k=t\), \(k=1,2, \ldots .\) Then

$$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =\inf _{x\in D}E(x). \end{aligned}$$

We now proceed to the rate of convergence results.

Theorem 3.4

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\le \gamma u^q\), \(1<q\le 2\). Take a number \(\epsilon \ge 0\) and an element \(f^\epsilon \) from \(D\) such that

$$\begin{aligned} E(f^\epsilon ) \le \inf _{x\in D}E(x)+ \epsilon ,\quad f^\epsilon /B \in A_1({\mathcal D}), \end{aligned}$$

with some number \(B\ge 1\). Then we have for the WCGA(co) (\(p:=q/(q-1)\)),

$$\begin{aligned} E(G_m)-\inf _{x\in D}E(x) \le \max \left( 2\epsilon , C(q,\gamma )B^q\left( C(E,q,\gamma )+\sum _{k=1}^mt_k^p\right) ^{1-q}\right) . \end{aligned}$$
(12)

4 Free Relaxation

Both of the above algorithms, the WCGA(co) and the WRGA(co), use the functional \(E'(G_{m-1})\) in a search for the \(m\)th element \(\varphi _m\) from the dictionary to be used in optimization. The construction of the approximant in the WRGA(co) is different from the construction in the WCGA(co). In the WCGA(co), we build the approximant \(G_m\) so as to maximally use the minimization power of the elements \(\varphi _1, \ldots , \varphi _m\). The WRGA(co) by its definition is designed for working with functions from \(A_1({\mathcal D})\). In building the approximant in the WRGA(co), we keep the property \(G_m\in A_1({\mathcal D})\). As we mentioned in Sect. 2, the relaxation parameter \(\lambda _m\) in the WRGA(co) is chosen at the \(m\)th step depending on \(E\). The following modification of the above idea of relaxation in greedy approximation will be studied in this section [38].

Weak Greedy Algorithm with Free Relaxation (WGAFR(co)) Let \(\tau :=\{t_m\}_{m=1}^\infty \), \(t_m\in [0,1]\), be a weakness sequence. We define \(G_0 := 0\). Then for each \(m\ge 1\), we have the following inductive definition:

  1. (F1)

    \(\varphi _m \in {\mathcal D}\) is any element satisfying

    $$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal D}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$
  2. (F2)

    Find \(w_m\) and \( \lambda _m\) such that

    $$\begin{aligned} E((1-w_m)G_{m-1} + \lambda _m\varphi _m) = \inf _{ \lambda ,w}E((1-w)G_{m-1} + \lambda \varphi _m), \end{aligned}$$

    and define

    $$\begin{aligned} G_m:= (1-w_m)G_{m-1} + \lambda _m\varphi _m. \end{aligned}$$

Remark 4.1

It follows from the definition of the WGAFR(co) that the sequence \(\{E(G_m)\}\) is a nonincreasing sequence.

In the case \(E(x)=E(f,x,q):= \Vert f-x\Vert ^q\), \(f\in X\), \(q\ge 1\), the WGAFR(co) coincides with the Weak Greedy Algorithm with Free Relaxation from approximation theory (see [40], S. 6.4).

We now formulate a convergence theorem for an arbitrary uniformly smooth convex function. Modulus of smoothness \(\rho (E,u)\) of a uniformly smooth convex function is an even convex function such that \(\rho (E,0)=0\) and

$$\begin{aligned} \lim _{u\rightarrow 0}\rho (E,u)/u=0. \end{aligned}$$

Theorem 4.2

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\). Assume that a sequence \(\tau :=\{t_k\}_{k=1}^\infty \) satisfies the following condition. For any \(\theta \in (0,\theta _0]\), we have

$$\begin{aligned} \sum _{m=1}^\infty t_m \xi _m(\rho ,\tau ,\theta )=\infty . \end{aligned}$$
(13)

Then, for the WGAFR(co), we have

$$\begin{aligned} \lim _{m\rightarrow \infty } E(G_m) =\inf _{x\in D}E(x). \end{aligned}$$

The following theorem gives the rate of convergence.

Theorem 4.3

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\le \gamma u^q\), \(1<q\le 2\). Take a number \(\epsilon \ge 0\) and an element \(f^\epsilon \) from \(D\) such that

$$\begin{aligned} E(f^\epsilon ) \le \inf _{x\in D}E(x)+ \epsilon ,\quad f^\epsilon /B \in A_1({\mathcal D}), \end{aligned}$$

with some number \(B\ge 1\). Then we have (\(p:=q/(q-1)\)) for the WGAFR(co)

$$\begin{aligned} E(G_m)-\inf _{x\in D}E(x) \le \max \left( 2\epsilon , C_1(E,q,\gamma )B^q\left( C_2(E,q,\gamma )+\sum _{k=1}^mt_k^p\right) ^{1-q}\right) . \end{aligned}$$
(14)

5 Discussion

The technique used in this paper is a modification of the corresponding technique developed in approximation theory (see [36, 39] and the book [40]). We now discuss this in more detail. In nonlinear approximation, we use greedy algorithms, for instance WCGA and WGAFR, for solving a sparse approximation problem. The greedy step is the one where we look for \(\varphi _m \in {\mathcal D}\) satisfying

$$\begin{aligned} F_{f_{m-1}}(\varphi _m ) \ge t_m \sup _{g\in {\mathcal D}}F_{f_{m-1}}(g). \end{aligned}$$

This step is based on the norming functional \(F_{f_{m-1}}\). For a nonzero element \(f\in X\), we let \(F_f\) denote a norming (peak) functional for \(f\) that is a functional with the following properties:

$$\begin{aligned} \Vert F_f\Vert =1,\qquad F_f(f) =\Vert f\Vert . \end{aligned}$$

The existence of such a functional is guaranteed by the Hahn-Banach theorem. The norming functional \(F_f\) is a linear functional (in other words is an element of the dual to \(X\) space \(X^*\)) which can be explicitly written in some cases. In a Hilbert space, \(F_f\) can be identified with \(f\Vert f\Vert ^{-1}\). In the real \(L_p\), \(1<p<\infty \), it can be identified with \(f|f|^{p-2}\Vert f\Vert _p^{1-p}\). The following proposition is well known (see, [40], p. 336).

Proposition 5.1

Let \(X\) be a uniformly smooth Banach space. Then, for any \(x\ne 0\) and \(y\), we have

$$\begin{aligned} F_x(y)=\left( \frac{\hbox {d}}{\hbox {d}u}\Vert x+uy\Vert \right) (0)=\lim _{u\rightarrow 0}(\Vert x+uy\Vert -\Vert x\Vert )/u. \end{aligned}$$
(15)

Proposition 5.1 says that the norming functional \(F_{f_{m-1}}\) is the derivative of the norm function \(E(x):=\Vert x\Vert \). Clearly, we can reformulate our problem of approximation of \(f\) as an optimization problem with \(E(x):=\Vert f-x\Vert \). It is a convex function; however, it is not a uniformly smooth function in the sense of smoothness of convex functions. A way out of this problem is to consider \(E(f,x,q):=\Vert f-x\Vert ^q\) with appropriate \(q\). For instance, it is known [10] that if \(\rho (u)\le \gamma u^q\), \(1<q\le 2\), then \(E(f,x,q)\) is a uniformly smooth convex function with modulus of smoothness of order \(u^q\). Next,

$$\begin{aligned} E'(f,x,q)=-q\Vert f-x\Vert ^{q-1}F_{f-x}. \end{aligned}$$

Therefore, the algorithms WCGA(co), WRGA(co), and WGAFR(co) coincide in this case with the corresponding algorithms WCGA, WRGA, and WGAFR from approximation theory. We note that from the definition of modulus of smoothness, we get the following inequality:

$$\begin{aligned} 0\le \Vert x+uy\Vert -\Vert x\Vert -uF_x(y)\le 2\Vert x\Vert \rho (u\Vert y\Vert /\Vert x\Vert ). \end{aligned}$$
(16)

In the proofs of approximation theory results, we use inequality (16) and the trivial inequality

$$\begin{aligned} \Vert x+uy\Vert \ge F_x(x+uy) = \Vert x\Vert +uF_x(y). \end{aligned}$$
(17)

In the proofs of optimization theory results, we use Lemma 6.3 instead of inequality (16) and the convexity inequality (8) instead of (17). The rest of the proofs use the same technique of solving the corresponding recurrent inequalities.

We stress that an important contribution of this paper is the fact that it provides convergence and rate of convergence results for an arbitrary dictionary. The authors of the paper [17] make the following comment on the importance of a step from the standard coordinate system basis to a special redundant dictionary: “In this paper, we overcome this conceptual obstacle by considering all possible (normalized) rank-one matrices as coordinates. This set of matrices forms an overcomplete and uncountable infinite basis of the space of matrices. We show that a simple strategy of performing a coordinate descent on this lifted space actually converges to the right solution.”

Our smoothness assumption on \(E\) was used in the proofs of all theorems from Sects. 24 in the form of Lemma 6.3. This means that in all those theorems, the assumption that \(E\) has modulus of smoothness \(\rho (E,u)\) can be replaced by the assumption that \(E\) satisfies the inequality

$$\begin{aligned} E(x+uy)-E(x)-u\langle E'(x),y\rangle \le 2\rho (E,u\Vert y\Vert ), \quad x\in D. \end{aligned}$$
(18)

Moreover, in Sect. 2, where we consider the WRGA(co), the approximants \(G_m\) are forced to stay in the convex hull \(A_1({\mathcal D})\). Therefore, in Theorems 2.4 and 2.5, we can use the following inequality instead of (18):

$$\begin{aligned} E(x+u(y-x))-E(x)-u\langle E'(x),y-x\rangle \le 2\rho (E,u\Vert y-x\Vert ), \end{aligned}$$
(19)

for \(x,y\in A_1({\mathcal D})\) and \(u\in [0,1]\).

We note that smoothness assumptions in the form of (19) with \(\rho (E,u\Vert y-x\Vert )\) replaced by \(C\Vert y-x\Vert ^q\) were used in many papers [12, 14, 18, 22, 45]. For instance, the authors of [45] studied the version of WRGA(co) with weakness sequence \(t_k=1\), \(k=1,2,\ldots \). They proved Theorem 2.5 in this case. Their proof, like our proof in Sect. 2, is very close to the corresponding proof from greedy approximation (see [36, 39] Sect. 3.3 or [40] Sect. 6.3).

We now make some general remarks on the results of this paper. A typical problem of convex optimization is to find an approximate solution to the problem

$$\begin{aligned} w:=\inf _x E(x). \end{aligned}$$
(20)

In this paper, we study sparse (with respect to a given dictionary \({\mathcal D}\)) solutions of (20). This means that we are solving the following problem instead of (20). For a given dictionary \({\mathcal D}\), consider the set of all \(m\)-term linear combinations with respect to \({\mathcal D}\):

$$\begin{aligned} \Sigma _m({\mathcal D}):=\left\{ x\in X: x=\sum _{i=1}^m c_ig_i,\quad g_i\in {\mathcal D}\right\} . \end{aligned}$$

We solve the following sparse optimization problem:

$$\begin{aligned} w_m:=\inf _{x\in \Sigma _m({\mathcal D})} E(x). \end{aligned}$$
(21)

In this paper, we have used greedy-type algorithms to approximately solve problem (21). Results of the paper show that it turns out that greedy-type algorithms with respect to \({\mathcal D}\) solve problem (20) too.

We are interested in a solution from \(\Sigma _m({\mathcal D})\). Clearly, when we optimize a linear form \(\langle F,g\rangle \) over the dictionary \({\mathcal D}\), we obtain the same value as optimization over the convex hull \(A_1({\mathcal D})\). We often use this property (see Lemma 6.2). However, at the greedy step of our algorithms, we choose

  1. (1)

    \(\varphi _m :=\varphi ^{c,\tau }_m \in {\mathcal D}\) is any element satisfying

    $$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal D}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$

Thus if we replace the dictionary \({\mathcal D}\) by its convex hull \(A_1({\mathcal D})\), we may take an element satisfying the above greedy condition which is not from \({\mathcal D}\) and could even be an infinite combination of the dictionary elements.

Next, we begin with a Banach space \(X\) and a convex function \(E(x)\) defined on this space. Properties of this function \(E\) are formulated in terms of a Banach space \(X\). If instead of a Banach space \(X\) we consider another Banach space, for instance, the one generated by \(A_1({\mathcal D})\) as a unit ball, then the properties of \(E\) will change. For instance, a typical example of \(E\) could be \(E(x):=\Vert f-x\Vert ^q\), with \(\Vert \cdot \Vert \) being the norm of the Banach space \(X\). Then our assumption that the set \( D:=\{x:E(x)\le E(0)\} \) is bounded is satisfied. However, this set is not necessarily bounded in the norm generated by \(A_1({\mathcal D})\).

All three greedy-type algorithms studied in this paper use the derivative \(E'\) of the objective function \(E\). The derivative \(E'\) is a linear functional on \(X\), or, in other words, is an element of the dual space \(X^*\). By analogy with approximation theory, we call these algorithms dual greedy algorithms. An important feature of the dual greedy algorithms is that they can be modified into a weak form. We obtained our results for any weakness sequence \(\tau =\{t_k\}_{k=1}^\infty \), \(t_k\in [0,1]\), \(k=1,2,\,\ldots .\) In particular, some of the \(t_k\) could be equal to zero. In the case \(t_k=0\), we can choose any element \(\varphi _k\in {\mathcal D}\) at the greedy step. This allows us to shape our approximant at the \(k\hbox {th}\) iteration by other criteria than greedy selection. However, our results guarantee convergence and rate of convergence for \(\tau \) satisfying the corresponding conditions.

On the example of three greedy-type algorithms, we demonstrated how the technique developed in greedy approximation theory can be modified for finding sparse solutions of convex optimization problems. We discussed in this paper only three greedy-type algorithms. There are many other greedy-type algorithms studied in approximation theory. Some of them—the ones providing expansions—are generalized for the convex optimization problem in [42]. There is an important class of greedy-type algorithms, namely the thresholding-type algorithms, that was not generalized for the convex optimization problem.

For the reader’s convenience, we now give a brief general description and classification of greedy-type algorithms for convex optimization.

The most difficult part of an algorithm is to find an element \(\varphi _m\in {\mathcal D}\) to be used in the approximation process. We consider greedy methods for finding \(\varphi _m\in {\mathcal D}\). We have two types of greedy steps to find \(\varphi _m\in {\mathcal D}\).

I. Gradient greedy step At this step, we look for an element \(\varphi _m\in {\mathcal D}\) such that

$$\begin{aligned} \langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal D}}\langle -E'(G_{m-1}),g\rangle . \end{aligned}$$

Algorithms that use the first derivative of the objective function \(E\) are called first-order optimization algorithms.

II. E -greedy step At this step, we look for an element \(\varphi _m\in {\mathcal D}\) which satisfies (we assume existence):

$$\begin{aligned} \inf _{c\in {\mathbb R}}E(G_{m-1}+c\varphi _m) = \inf _{g\in {\mathcal D},c\in {\mathbb R}}E(G_{m-1}+cg) . \end{aligned}$$

Algorithms that only use the values of the objective function \(E\) are called zero-order optimization algorithms.

The above WGAFR(co) uses the greedy step of type I. In this paper, we only discuss algorithms based on the greedy step of type I. These algorithms fall into a category of the first-order methods. The greedy step of type II uses only the function values \(E(x)\). We discussed some of the algorithms of this type in [42] and plan to study them in our future work.

After finding \(\varphi _m\in {\mathcal D}\), we can proceed in different ways. We now list some typical steps that are motivated by the corresponding steps in greedy approximation theory [40]. These steps or their variants are used in different optimization algorithms (see, for instance, [6, 7, 21, 26, 31, 33]).

  1. (A)

    Best step in the direction \(\varphi _m\in {\mathcal D}\). We choose \(c_m\) such that

    $$\begin{aligned} E(G_{m-1}+c_m\varphi _m) = \inf _{c\in {\mathbb R}}E(G_{m-1}+c\varphi _m) \end{aligned}$$

    and define

    $$\begin{aligned} G_m:=G_{m-1}+c_m\varphi _m. \end{aligned}$$
  2. (B)

    Shortened best step in the direction \(\varphi _m\in {\mathcal D}\). We choose \(c_m\) as in (A) and for a given parameter \(b>0\), define

    $$\begin{aligned} G_m^b:=G_{m-1}^b+bc_m\varphi _m. \end{aligned}$$

    Usually, \(b\in (0,1)\). This is why we call it shortened.

  3. (C)

    Chebyshev-type (fully corrective) methods. We choose \(G_m\in \mathrm{span }(\varphi _1,\ldots ,\varphi _m)\) which satisfies

    $$\begin{aligned} E(G_m)=\inf _{c_j,j=1,\ldots ,m}E(c_1\varphi _1+\cdots +c_m\varphi _m). \end{aligned}$$
  4. (D)

    Fixed relaxation. For a given sequence \(\{r_k\}_{k=1}^\infty \) of relaxation parameters \(r_k\in [0,1)\) we choose \(G_m:=(1-r_m)G_{m-1}+c_m\varphi _m\), with \(c_m\) from

    $$\begin{aligned} E((1-r_m)G_{m-1}+c_m\varphi _m)=\inf _{c\in {\mathbb R}}E((1-r_m)G_{m-1}+c\varphi _m). \end{aligned}$$
  5. (F)

    Free relaxation. We choose \(G_m\in \mathrm{span }(G_{m-1},\varphi _m)\) which satisfies

    $$\begin{aligned} E(G_m)=\inf _{c_1,c_2}E(c_1G_{m-1}+ c_2\varphi _m). \end{aligned}$$
  6. (G)

    Prescribed coefficients. For a given sequence \(\{c_k\}_{k=1}^\infty \) of positive coefficients, in the case of greedy step I, we define

    $$\begin{aligned} G_m:=G_{m-1}+c_m\varphi _m. \end{aligned}$$
    (22)

    In the case of greedy step II, we define \(G_m\) by formula (22) with the greedy step II modified as follows: \(\varphi _m\in {\mathcal D}\) is an element satisfying

    $$\begin{aligned} E(G_{m-1}+c_m\varphi _m)=\inf _{g\in {\mathcal D}}E(G_{m-1}+c_mg). \end{aligned}$$

All algorithms studied in this paper fall into the category I of algorithms with the gradient-type greedy step. The reader can find a study of algorithms with the \(E\)-greedy-type step in [16] and [42]. The step (C) corresponds to the Weak Chebyshev Greedy Algorithm from Sect. 3, and step (F) corresponds to the Weak Greedy Algorithm with Free Relaxation from Sect. 4. The reader can find a detailed study of algorithms with step (G) in [42].

6 Proofs

We begin with the following two simple and well-known lemmas.

Lemma 6.1

Let \(E\) be a uniformly smooth convex function on a Banach space \(X\) and \(L\) be a finite-dimensional subspace of \(X\). Let \(x_L\) denote the point from \(L\) at which \(E\) attains the minimum:

$$\begin{aligned} E(x_L)=\inf _{x\in L}E(x). \end{aligned}$$

Then we have

$$\begin{aligned} \langle E'(x_L),\phi \rangle =0 \end{aligned}$$

for any \(\phi \in L\).

Lemma 6.2

For any bounded linear functional \(F\) and any dictionary \({\mathcal D}\), we have

$$\begin{aligned} \sup _{g\in {\mathcal D}}\langle F,g\rangle = \sup _{f\in A_1({\mathcal D})} \langle F,f\rangle . \end{aligned}$$

See [40], p. 343 for the proof.

We will often use the following simple lemma.

Lemma 6.3

Let \(E\) be Fréchet differentiable convex function. Then the following inequality holds for \(x\in D\):

$$\begin{aligned} 0\le E(x+uy)-E(x)-u\langle E'(x),y\rangle \le 2\rho (E,u\Vert y\Vert ). \end{aligned}$$
(23)

Proof

The left inequality follows directly from (8). Next, from the definition of modulus of smoothness, it follows that

$$\begin{aligned} E(x+uy)+E(x-uy)\le 2(E(x)+\rho (E,u\Vert y\Vert )). \end{aligned}$$
(24)

Inequality (8) gives

$$\begin{aligned} E(x-uy)\ge E(x) + \langle E'(x),-uy\rangle =E(x)-u\langle E'(x),y\rangle . \end{aligned}$$
(25)

Combining (24) and (25), we obtain

$$\begin{aligned} E(x+uy)\le E(x)+u\langle E'(x),y\rangle +2\rho (E,u\Vert y\Vert ). \end{aligned}$$

This proves the second inequality.\(\square \)

We begin with the proofs of Theorems 3.1 and 3.4 for the WCGA(co) because these proofs are the simplest ones. Then we present proofs for results on the WRGA(co) and the WGAFR(co). Some parts of these proofs are similar to the corresponding parts of proofs of Theorems 3.1 and 3.4. We will not duplicate these parts and refer the reader to the proofs of Theorems 3.1 and 3.4.

The following lemma is a key lemma in studying convergence and rate of convergence of WCGA(co).

Lemma 6.4

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\). Take a number \(\epsilon \ge 0\) and an element \(f^\epsilon \) from \(D\) such that

$$\begin{aligned} E(f^\epsilon ) \le \inf _{x\in X}E(x)+ \epsilon ,\quad f^\epsilon /B \in A_1({\mathcal D}), \end{aligned}$$

with some number \(B\ge 1\). Then we have for the WCGA(co),

$$\begin{aligned}&E(G_m)-E(f^\epsilon ) \le E(G_{m-1})-E(f^\epsilon )\\&\quad + \inf _{\lambda \ge 0}(-\lambda t_mB^{-1}(E(G_{m-1})-E(f^\epsilon )) + 2\rho (E,\lambda )) \end{aligned}$$

for \( m=1,2,\,\ldots .\)

Proof

The main idea of the proof is the same as in the proof of the corresponding one-step improvement inequality for the WCGA (see, for instance, [40], p. 343–344). It follows from the definition of WCGA(co) that \(E(0)\ge E(G_1)\ge E(G_2)\ldots \). Therefore, if \(E(G_{m-1})-E(f^\epsilon )\le 0\), then the claim of Lemma 6.4 is trivial. Assume \(E(G_{m-1})-E(f^\epsilon )>0\). By Lemma 6.3, we have for any \(\lambda \),

$$\begin{aligned} E(G_{m-1}+\lambda \varphi _m) \le E(G_{m-1}) - \lambda \langle -E'(G_{m-1}),\varphi _m\rangle + 2 \rho (E,\lambda ), \end{aligned}$$
(26)

and by (C1) from the definition of the WCGA(co) and Lemma 6.2, we get

$$\begin{aligned}&\langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal D}} \langle -E'(G_{m-1}),g\rangle \\&\quad =t_m\sup _{\phi \in A_1({\mathcal D})} \langle -E'(G_{m-1}),\phi \rangle \ge t_m B^{-1} \langle -E'(G_{m-1}),f^\epsilon \rangle . \end{aligned}$$

By Lemma 6.1 with \(x_L=G_{m-1}\) and by convexity (9), we obtain

$$\begin{aligned} \langle -E'(G_{m-1}),f^\epsilon \rangle = \langle -E'(G_{m-1}),f^\epsilon -G_{m-1}\rangle \ge E(G_{m-1})-E(f^\epsilon ). \end{aligned}$$

Thus,

$$\begin{aligned} E(G_m)&\le \inf _{\lambda \ge 0} E(G_{m-1}+ \lambda \varphi _m)\nonumber \\&\le E(G_{m-1}) + \inf _{\lambda \ge 0}(-\lambda t_mB^{-1}(E(G_{m-1})-E(f^\epsilon )) + 2\rho (E,\lambda ), \end{aligned}$$
(27)

which proves the lemma.\(\square \)

Proof of Theorem 3.1

The definition of the WCGA(co) implies that \(\{E(G_m)\}\) is a nonincreasing sequence. Therefore we have

$$\begin{aligned} \lim _{m\rightarrow \infty }E(G_m) =a. \end{aligned}$$

Define

$$\begin{aligned} b:=\inf _{x\in D}E(x),\quad \alpha := a-b. \end{aligned}$$

We prove that \(\alpha =0\) by contradiction. Assume to the contrary that \(\alpha >0\). Then, for any \(m\), we have

$$\begin{aligned} E(G_m)-b \ge \alpha . \end{aligned}$$

We set \(\epsilon =\alpha /2\) and find \(f^\epsilon \) such that

$$\begin{aligned} E(f^\epsilon ) \le b+\epsilon \quad \text {and}\quad f^\epsilon /B \in A_1({\mathcal D}), \end{aligned}$$

with some \(B\ge 1\). Then, by Lemma 6.4, we get

$$\begin{aligned} E(G_m)-E(f^\epsilon ) \le E(G_{m-1}) -E(f^\epsilon ) + \inf _{\lambda \ge 0} (-\lambda t_mB^{-1}\alpha /2 +2\rho (E,\lambda )). \end{aligned}$$

Let us specify \(\theta :=\min \left( \theta _0,\frac{\alpha }{8B}\right) \) and take \(\lambda = \xi _m(\rho ,\tau ,\theta )\). Then we obtain

$$\begin{aligned} E(G_m) \le E(G_{m-1}) -2\theta t_m\xi _m. \end{aligned}$$

The assumption

$$\begin{aligned} \sum _{m=1}^\infty t_m\xi _m =\infty \end{aligned}$$

brings a contradiction, which proves the theorem.\(\square \)

Proof of Theorem 3.4

Define

$$\begin{aligned} a_n:=E(G_n)-E(f^\epsilon ). \end{aligned}$$

The sequence \(\{a_n\}\) is nonincreasing. If \(a_n\le 0\) for some \(n\le m\), then \(E(G_m)-E(f^\epsilon )\le 0\) and \(E(G_m)-\inf _{x\in D}E(x) \le \epsilon \), which implies (12). Thus we assume that \(a_n>0\) for \(n\le m\).

By Lemma 6.4, we have

$$\begin{aligned} a_m \le a_{m-1}+\inf _{\lambda \ge 0} \left( -\frac{\lambda t_m a_{m-1}}{B} + 2\gamma \lambda ^q\right) . \end{aligned}$$
(28)

Choose \(\lambda \) from the equation

$$\begin{aligned} \frac{\lambda t_m a_{m-1}}{B} = 4\gamma \lambda ^q, \end{aligned}$$

which implies that

$$\begin{aligned} \lambda =\left( \frac{ t_m a_{m-1}}{4\gamma B}\right) ^{\frac{1}{q-1}} . \end{aligned}$$

Let

$$\begin{aligned} A_q := 2(4\gamma )^{\frac{1}{q-1}}. \end{aligned}$$

Using the notation \(p:= \frac{q}{q-1}\), we get from (28),

$$\begin{aligned} a_m \le a_{m-1}\left( 1-\frac{\lambda t_m}{2B} \right) = a_{m-1}\left( 1-t_m^pa_{m-1}^{\frac{1}{q-1}}/(A_qB^{p})\right) . \end{aligned}$$

Raising both sides of this inequality to the power \(\frac{1}{q-1}\) and taking into account the inequality \(x^r\le x\) for \(r\ge 1\), \(0\le x\le 1\), we obtain

$$\begin{aligned} a_m^{\frac{1}{q-1}} \le a_{m-1}^{\frac{1}{q-1}} \left( 1-t^p_ma_{m-1}^{\frac{1}{q-1}}/(A_qB^{p})\right) . \end{aligned}$$

\(\square \)

We now need a simple known lemma [35].

Lemma 6.5

Suppose that a sequence \(y_1\ge y_2\ge \cdots \ge 0\) satisfies inequalities

$$\begin{aligned} y_k\le y_{k-1}(1-w_ky_{k-1}),\quad w_k\ge 0, \end{aligned}$$

for \(k>n\). Then for \(m>n\), we have

$$\begin{aligned} \frac{1}{y_m}\ge \frac{1}{y_n}+\sum _{k=n+1}^m w_k. \end{aligned}$$

Proof

It follows from the chain of inequalities

$$\begin{aligned} \frac{1}{y_k}\ge \frac{1}{y_{k-1}}(1-w_ky_{k-1})^{-1} \ge \frac{1}{y_{k-1}}(1+w_ky_{k-1})=\frac{1}{y_{k-1}}+w_k. \end{aligned}$$

\(\square \)

By Lemma 6.5 with \(y_k:=a_k^{\frac{1}{q-1}}\), \(n=0\), \(w_k=t^p_m/(A_qB^{p})\), we get

$$\begin{aligned} a_m^{\frac{1}{q-1}} \le C_1(q,\gamma ) B^{p}\left( C(E,q,\gamma )+\sum _{n=1}^m t_n^p\right) ^{-1}, \end{aligned}$$

which implies

$$\begin{aligned} a_m\le C(q,\gamma )B^q\left( C(E,q,\gamma )+\sum _{n=1}^m t_n^p\right) ^{1-q}. \end{aligned}$$

Theorem 3.4 is now proved.

Proof of Theorems 2.4 and 2.5

This proof is similar to the Proof of Theorems 3.1 and 3.4. Instead of Lemma 6.4, we use the following lemma. \(\square \)

Lemma 6.6

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\). Then, for any \(f\in A_1({\mathcal D})\), we have for the WRGA(co),

$$\begin{aligned} E(G_m)&\le E(G_{m-1} )+ \inf _{0\le \lambda \le 1}(-\lambda t_m (E(G_{m-1} )\nonumber \\&-E(f))+ 2\rho (E, 2\lambda )), \quad m=1,2,\ldots . \end{aligned}$$

Proof

We have

$$\begin{aligned} G_m := (1-\lambda _m)G_{m-1}+\lambda _m\varphi _m = G_{m-1}+\lambda _m(\varphi _m-G_{m-1}) \end{aligned}$$

and

$$\begin{aligned} E(G_m) = \inf _{0\le \lambda \le 1}E(G_{m-1}+\lambda (\varphi _m-G_{m-1})). \end{aligned}$$

As for (26), we have for any \(\lambda \),

$$\begin{aligned}&E(G_{m-1}+\lambda (\varphi _m-G_{m-1}))\nonumber \\&\quad \le E(G_{m-1}) - \lambda \langle -E'(G_{m-1}),\varphi _m-G_{m-1}\rangle + 2 \rho (E,2\lambda ), \end{aligned}$$
(29)

and by (R1) from the definition of the WRGA(co) and Lemma 6.2, we get

$$\begin{aligned}&\langle -E'(G_{m-1}),\varphi _m-G_{m-1}\rangle \ge t_m \sup _{g\in {\mathcal D}} \langle -E'(G_{m-1}),g-G_{m-1}\rangle \\&\quad = t_m\sup _{\phi \in A_1({\mathcal D})} \langle -E'(G_{m-1}),\phi -G_{m-1}\rangle \ge t_m \langle -E'(G_{m-1}),f-G_{m-1}\rangle . \end{aligned}$$

By (9), we obtain

$$\begin{aligned} \langle -E'(G_{m-1}),f-G_{m-1}\rangle \ge E(G_{m-1})-E(f). \end{aligned}$$

Thus,

$$\begin{aligned}&E(G_m) \le \inf _{0\le \lambda \le 1} E(G_{m-1}+ \lambda (\varphi _m-G_{m-1}))\nonumber \\&\quad \le E(G_{m-1}) + \inf _{0\le \lambda \le 1}(-\lambda t_m (E(G_{m-1})-E(f)) + 2\rho (E,2\lambda ), \end{aligned}$$
(30)

which proves the lemma.\(\square \)

The remaining part of the proof uses the inequality (30) in the same way relation (27) was used in the Proof of Theorems 3.1 and 3.4. The only additional difficulty here is that we are optimizing over \(0\le \lambda \le 1\). In the Proof of Theorem 2.4, we choose \(\theta =\alpha /8\), assuming that \(\alpha \) is small enough to guarantee that \(\theta \le \theta _0\) and \(\lambda = \xi _m(\rho ,\tau ,\theta )/2\).

We proceed to the Proof of Theorem 2.5. Define

$$\begin{aligned} a_n:=E(G_n)-E(f). \end{aligned}$$

The sequence \(\{a_n\}\) is nonincreasing. If \(a_n\le 0\) for some \(n\le m\), then \(E(G_m)-E(f)\le 0\), which implies Theorem 2.5. Thus we assume that \(a_n>0\) for \(n\le m\). We obtain from Lemma 6.6,

$$\begin{aligned} a_m\le a_{m-1} +\inf _{0\le \lambda \le 1}(-\lambda t_ma_{m-1}+2\gamma (2\lambda )^q). \end{aligned}$$

We choose \(\lambda \) from the equation

$$\begin{aligned} \lambda t_m a_{m-1}= 4\gamma (2\lambda )^q \end{aligned}$$
(31)

if it is not greater than \(1\) and choose \(\lambda =1\) otherwise. The sequence \(\{a_k\}\) is monotone decreasing, and therefore we may choose \(\lambda =1\) only at the first \(n\) steps and then choose \(\lambda \) from (31). Then we get for \(k\le n\),

$$\begin{aligned} a_k\le a_{k-1}(1-t_k/2) \end{aligned}$$

and

$$\begin{aligned} a_n\le a_{0}\prod _{k=1}^n(1-t_k/2). \end{aligned}$$
(32)

For \(k>n\), we have

$$\begin{aligned} a_k\le a_{k-1}(1-\lambda t_k/2),\quad \lambda = \left( \frac{t_ma_{m-1}}{2^{2+q}\gamma }\right) ^{\frac{1}{q-1}}. \end{aligned}$$
(33)

As in the Proof of Theorem 3.4, we obtain, using Lemma 6.5,

$$\begin{aligned} \frac{1}{y_m}\ge \frac{1}{y_n} +\sum _{k=n+1}^m w_k,\quad y_k:=a_k^{\frac{1}{q-1}},\quad w_k:=\frac{t_k^p}{2(2^{2+q}\gamma )^{\frac{1}{q-1}}}. \end{aligned}$$

By (32), we get

$$\begin{aligned} \frac{1}{y_n} \ge \frac{1}{y_0}\prod _{k=1}^n(1-t_k/2)^{\frac{1}{1-q}}. \end{aligned}$$

Next,

$$\begin{aligned} \prod _{k=1}^n(1-t_k/2)^{\frac{1}{1-q}}&\ge \prod _{k=1}^n(1+t_k/2)^{\frac{1}{q-1}} \ge \prod _{k=1}^n(1+ t_k/2)\\&\ge 1+\frac{1}{2}\sum _{k=1}^n t_k\ge 1+\frac{1}{2}\sum _{k=1}^n t_k^p. \end{aligned}$$

Combining the above inequalities, we complete the proof.

Proof of Theorems 4.2 and 4.3

We begin with an analog of Lemma 6.4.\(\square \)

Lemma 6.7

Let \(E\) be a uniformly smooth convex function with modulus of smoothness \(\rho (E,u)\). Take a number \(\epsilon \ge 0\) and an element \(f^\epsilon \) from \(D\) such that

$$\begin{aligned} E(f^\epsilon ) \le \inf _{x\in D}E(x)+ \epsilon ,\quad f^\epsilon /B \in A_1({\mathcal D}), \end{aligned}$$

with some number \(B\ge 1\). Then we have for the WGAFR(co),

$$\begin{aligned}&E(G_m)-E(f^\epsilon ) \le E(G_{m-1})-E(f^\epsilon )\\&\quad + \inf _{\lambda \ge 0}(-\lambda t_mB^{-1}(E(G_{m-1})-E(f^\epsilon )) + 2\rho (E,C_0\lambda )) \end{aligned}$$

for \( m=1, 2, \ldots .\)

Proof

By the definition of \(G_m\),

$$\begin{aligned} E(G_m)\le \inf _{\lambda \ge 0,w}E(G_{m-1}-wG_{m-1}+\lambda \varphi _m). \end{aligned}$$

As in the arguments in the Proof of Lemma 6.4, we use Lemma 6.3

$$\begin{aligned}&E(G_{m-1}+\lambda \varphi _m -wG_{m-1}) \le E(G_{m-1})\nonumber \\&\quad - \lambda \langle -E'(G_{m-1}),\varphi _m\rangle -w\langle E'(G_{m-1}),G_{m-1}\rangle + 2 \rho (E,\Vert \lambda \varphi _m -wG_{m-1}\Vert )\nonumber \\ \end{aligned}$$
(34)

and estimate

$$\begin{aligned}&\langle -E'(G_{m-1}),\varphi _m\rangle \ge t_m \sup _{g\in {\mathcal D}} \langle -E'(G_{m-1}),g\rangle \\&\quad =t_m\sup _{\phi \in A_1({\mathcal D})} \langle -E'(G_{m-1}),\phi \rangle \ge t_m B^{-1} \langle -E'(G_{m-1}),f^\epsilon \rangle . \end{aligned}$$

We set \(w^*:=\lambda t_mB^{-1}\) and obtain

$$\begin{aligned}&E(G_{m-1}-w^*G_{m-1}+\lambda \varphi _m)\nonumber \\&\quad \le E(G_{m-1})-\lambda t_mB^{-1}\langle -E'(G_{m-1}),f^\epsilon -G_{m-1}\rangle . \end{aligned}$$
(35)

By (9), we obtain

$$\begin{aligned} \langle -E'(G_{m-1}),f^\epsilon -G_{m-1}\rangle \ge E(G_{m-1})-E(f^\epsilon ). \end{aligned}$$

Thus,

$$\begin{aligned} E(G_m)&\le E(G_{m-1})+ \inf _{\lambda \ge 0}(-\lambda t_mB^{-1}(E(G_{m-1})-E(f^\epsilon )) \nonumber \\&\quad +\,2\rho (E,\Vert \lambda \varphi _m -w^*G_{m-1}\Vert ). \end{aligned}$$
(36)

We now estimate

$$\begin{aligned} \Vert w^*G_{m-1}-\lambda \varphi _m\Vert \le w^*\Vert G_{m-1}\Vert +\lambda . \end{aligned}$$

Next, \(E(G_{m-1})\le E(0)\), and, therefore, \(G_{m-1}\in D\). Our assumption on boundedness of \(D\) implies that \(\Vert G_{m-1}\Vert \le C_1:= \text {diam}(D)\). Thus, under assumption \(B\ge 1\), we get

$$\begin{aligned} w^*\Vert G_{m-1}\Vert \le C_1\lambda t_m \le C_1\lambda . \end{aligned}$$

Finally,

$$\begin{aligned} \Vert w^*G_{m-1}-\lambda \varphi _m\Vert \le C_0\lambda . \end{aligned}$$

This completes the Proof of Lemma 4.1.\(\square \)

By Remark 4.1, \(\{E(G_m)\}\) is a nonincreasing sequence. Therefore we have

$$\begin{aligned} \lim _{m\rightarrow \infty }E(G_m) =a. \end{aligned}$$

Define

$$\begin{aligned} b:=\inf _{x\in D}E(x),\quad \alpha := a-b. \end{aligned}$$

We prove that \(\alpha =0\) by contradiction. Assume to the contrary that \(\alpha >0\). Then, for any \(m\), we have

$$\begin{aligned} E(G_m)-b \ge \alpha . \end{aligned}$$

We set \(\epsilon =\alpha /2\) and find \(f^\epsilon \) such that

$$\begin{aligned} E(f^\epsilon ) \le b+\epsilon \quad \text {and}\quad f^\epsilon /B \in A_1({\mathcal D}), \end{aligned}$$

with some \(B\ge 1\). Then, by Lemma 6.7, we get

$$\begin{aligned} E(G_m)-E(f^\epsilon ) \le E(G_{m-1}) -E(f^\epsilon ) + \inf _{\lambda \ge 0} (-\lambda t_mB^{-1}\alpha /2 +2\rho (E,C_0\lambda )). \end{aligned}$$

Let us specify \(\theta :=\min \left( \theta _0,\frac{\alpha }{8B}\right) \) and take \(\lambda = C_0\xi _m(\rho ,\tau ,\theta )\). Then we obtain

$$\begin{aligned} E(G_m) \le E(G_{m-1}) -2\theta t_m\xi _m. \end{aligned}$$

The assumption

$$\begin{aligned} \sum _{m=1}^\infty t_m\xi _m =\infty \end{aligned}$$

brings a contradiction, which proves Theorem 4.2.

We proceed to the Proof of Theorem 4.3. Define

$$\begin{aligned} a_n:=E(G_n)-E(f^\epsilon ). \end{aligned}$$

By Lemma 6.7, we have

$$\begin{aligned} a_m \le a_{m-1}+\inf _{\lambda \ge 0} \left( -\frac{\lambda t_m a_{m-1}}{B} + 2\gamma (C_0\lambda )^q\right) . \end{aligned}$$
(37)

Choose \(\lambda \) from the equation

$$\begin{aligned} \frac{\lambda t_m a_{m-1}}{B} = 4\gamma (C_0\lambda )^q. \end{aligned}$$

The rest of the proof repeats the argument from the Proof of Theorem 3.4.