1 Introduction

Consider the minimization problem

$$\begin{aligned} \min _{\mathbf{x}\in X} \left\{ f(\mathbf{x})\equiv g(\mathbf{E}\mathbf{x})+ \left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle \right\} , \end{aligned}$$
(P)

where \(X \subseteq \mathbb {R}^n\) is a compact polyhedral set, \(\mathbf{E}\in \mathbb {R}^{m \times n}, \mathbf{b}\in \mathbb {R}^n\) and \(g: \mathbb {R}^m \rightarrow \mathbb {R}\) is strongly convex and continuously differentiable over \(\mathbb {R}^m\). Note that for a general matrix \(\mathbf{E}\), the function f is not necessarily strongly convex.

When the problem at hand is large-scale, first-order methods, which have relatively low computational cost per iteration, are usually utilized. These methods include, for example, the class of projected (proximal) gradient methods. A drawback of these methods is that under general convexity assumptions, they posses only a sublinear rate of convergence [2, 19], while a linear rate of convergence can be established only under additional conditions such as strong convexity of the objective function [19]. Luo and Tseng [21] show that the strong convexity assumption can be relaxed and replaced by an assumption on the existence of a local error bound, and under this assumption, certain classes of algorithms, which they refer to as “feasible descent methods”, converge in an asymptotic linear time. The model (P) with assumptions on strong convexity of g, compactness and polyhedrality of X was shown in [21] to satisfy the error bound. Wang and Lin [23] extend [21] and show that there exists a global error bound for problem (P) with the additional assumption of compactness of X, and derive the exact linear rate for this case. Note that the family of “feasible descent methods” includes the block alternating minimization algorithm (under the assumption of block strong convexity), as well as gradient projection methods, and therefore are usually at least as complex as evaluating at each iteration the orthogonal projection operator onto the feasible set X.

An alternative to algorithms based on projection (or proximal) operators are linear-oracle-based algorithms such as the conditional gradient (CG) method. The CG algorithm was presented by Frank and Wolfe [8] in 1956 for minimizing a convex function over a compact polyhedral set. At each iteration, the algorithm requires a solution to the problem of minimizing a linear objective function over the feasible set. It is assumed that this solution is obtained by a call to a linear-oracle, i.e., a black box which, given a linear function, returns an optimal solution of this linear function over the feasible set (see an exact definition in Sect. 2.3). In some instances, and specifically for certain types of polyhedral sets, obtaining such a linear-oracle can be done more efficiently than computing the orthogonal projection onto the feasible set (see examples in [10]), and therefore the CG algorithm has an advantage over projection-based algorithms. The original paper of Frank and Wolfe also contains a proof of an O(1 / k) rate of convergence of the function values to the optimal value. Levitin and Polyak [18] show that this O(1 / k) rate can also be extended to the case where the feasible set is a general compact convex set. Cannon and Culum [5] prove that this rate is in fact tight. However, if in addition to strong convexity of the objective function, the optimal solution is in the interior of the feasible set, then linear rate of convergence of the CG method can be established [12].Footnote 1 Epelman and Freund [7], as well as Beck and Teboulle [1], show a linear rate of convergence of the conditional gradient with a special stepsize choice in the context of finding a point in the intersection of an affine space and a closed and convex set under a Slater-type assumption. Another setting in which linear rate of convergence can be derived is when the feasible set is uniformly (strongly) convex and the norm of the gradient of the objective function is bounded away from zero [18].

Another approach for deriving a linear rate of convergence is to modify the algorithm. For example, Hazan and Garber [10] use local linear-oracles in order to show linear rate of convergence of a “localized” version of the conditional gradient method. A different modification is to use a variation of the conditional gradient method that incorporates away steps. This version of the conditional gradient method, which we refer to as away steps conditional gradient (ASCG), was initially suggested by Wolfe [24] and then studied by Guelat and Marcotte [12], where a linear rate of convergence was established under the assumption that the objective function is strongly convex and the set is a compact polytope, as well as an assumption on the location of the optimal solution. A modified version of the away step method, which calculates the away step on a simplex type representation of the problem, was studied by Jaggi and Lacoste-Julien [16], which were able to show linear convergence for the more general model (P) for the case where \(\mathbf{b}=\mathbf{0}\), without restrictions on the location of the solution. Jaggi proves [15] that the rate of convergence of any method which only adds or removes one atom at a time must be at least sublinear in the first n iterations with n being the dimension of the space. Although the original ASCG suggested by Wolfe and the modified ASCG discussed in [16] are closely related, they have some significant differences. Specifically, as stated in recent work of Freund, Grigas and Mazumder [9], in order to analyze the convergence of this new modification of the ASCG algorithm, it is required that the linear-oracle produces outputs which reside in a certain finite set of points (e.g., the set of extreme points in the case of a polytope), while the analysis of the original ASCG in [12] does not require such an assumption. In this paper we will refer to the variation presented in [16] as “the ASCG method”, and call its required oracle a vertex linear-oracle (see the discussion in Sect. 3.1).

Contribution. In this work, our starting point and main motivation are the results of Jaggi and Lacoste-Julien [16]. Our contribution is twofold:

  1. (a)

    We extend the results given in [16] and show that the ASCG algorithm converges linearly for the general case of problem (P), that is, for any value of \(\mathbf{E}\) and \(\mathbf{b}\).

    The additional linear term \(\langle \mathbf{b},\mathbf{x}\rangle \) enables us to consider much more general models. For example, consider the \(l_1\)-regularized least squares problem \(\min _{\mathbf{x}\in S} \{\left\| {\mathbf{B}\mathbf{x}-\mathbf{c}}\right\| _2^2+\lambda \Vert \mathbf{x}\Vert _1\},\) where \(S \subseteq \mathbb {R}^n\) is a compact polyhedral set, \(\mathbf{B}\in \mathbb {R}^{k \times n}, \mathbf{c}\in \mathbb {R}^k\) and \(\lambda >0\). Since S is compact, there exists a constant \(M>0\) for which \(\Vert \mathbf{x}\Vert _1 \le M\) for any optimal solution \(\mathbf{x}\).We can now rewrite the model as

    $$\begin{aligned} \min _{\mathbf{x}\in S, \Vert \mathbf{x}\Vert _1 \le y, y \in [0,M]} \left\| {\mathbf{B}\mathbf{x}-\mathbf{c}}\right\| ^2+\lambda y, \end{aligned}$$

    which obviously fits the general model (P).

  2. (b)

    We present an analysis based on simple linear programming duality arguments, which are completely different than the geometric arguments in [16]. Consequently, we obtain a computable constant for the rate of convergence, which is explicitly expressed as a function of the problem’s parameters and the geometry of the feasible set. This constant, which we call “the vertex-facet distance constant”, replaces the so-called pyramidal width constant from [16], which reflects the geometry of the feasible set and is obtained as the optimal value of a very complex mixed integer saddle point optimization problem whose exact value is unknown even for simple polyhedral sets.

Paper outline. The paper is organized as follows. Section 2 presents some preliminary results and definitions needed for the analysis. In particular, it provides a brief introduction to the classical CG algorithm and linear oracles. Section 3 presents the ASCG algorithm and the convergence analysis, and is divided into four subsections. In Sect. 3.1 the concept of vertex linear-oracle, needed for the implementation of ASCG, is presented, and the difficulties of obtaining a vertex linear-oracle on a linear transformation of the feasible set are discussed. In Sect. 3.2 we present the ASCG method with different possible stepsize choices. In Sect. 3.3, we provide the rate of convergence analysis of the ASCG for problem (P), and present the new vertex-facet distance constant used in the analysis. In Sect. 3.4, we demonstrate how to compute this new constant for a few examples of simple polyhedral sets.

Notations. We denote the cardinality of set I by |I|. The difference, union and intersection of two given sets I and J are denoted by \(I{\setminus }J=\left\{ {a\in I: a\notin J}\right\} \), \(I\cup J\) and \(I\cap J\) respectively. Subscript indices represent elements of a vector, while superscript indices represent iterates of the vector, i.e., \(x_i\) is the \(i\hbox {th}\) element of vector \(\mathbf{x}\), \(\mathbf{x}^k\) is a vector at iteration k, and \(x^k_i\) is the \(i\hbox {th}\) element of \(\mathbf{x}^k\). The vector \(\mathbf{e}_i\in \mathbb {R}^n\) is the \(i\hbox {th}\) vector of the standard basis of \(\mathbb {R}^n\), \(\mathbf{0}\in \mathbb {R}^n\) is the all-zeros vector, and \(\mathbf{1}\in \mathbb {R}^n\) is the vector of all ones. Given two vectors \(\mathbf{x},\,\mathbf{y}\in \mathbb {R}^n\), their dot product is denoted by \(\left\langle {\mathbf{x}},{\mathbf{y}}\right\rangle \). Given a matrix \(\mathbf{A}\in \mathbb {R}^{m\times n}\) and vector \(\mathbf{x}\in \mathbb {R}^n\), \(\left\| {\mathbf{A}}\right\| \) denotes the spectral norm of \(\mathbf{A}\), and \(\left\| {\mathbf{x}}\right\| \) denotes the \(\ell _2\)-norm of \(\mathbf{x}\), unless stated otherwise. \(\mathbf{A}^T\), \({{\mathrm{rank}}}(\mathbf{A})\) and \({{\mathrm{Im}}}(\mathbf{A})\) represent the transpose, rank and image of \(\mathbf{A}\) respectively. We denote the \(i\hbox {th}\) row of a given matrix \(\mathbf{A}\) by \(\mathbf{A}_i\), and given a set \(I\subseteq \left\{ {1,\ldots ,m}\right\} \), \(\mathbf{A}_I\in \mathbb {R}^{|I|\times n}\) is the submatrix of \(\mathbf{A}\) such that \((\mathbf{A}_I)_{j}=\mathbf{A}_{I_j}\) for any \(j=1,\ldots , |I|\). If \(\mathbf{A}\) is a symmetric matrix, then \(\lambda _{\min }\left( {\mathbf{A}}\right) \) is its minimal eigenvalue. If a matrix \(\mathbf{A}\) is also invertible, we denote its inverse by \(\mathbf{A}^{-1}\). Given matrices \(\mathbf{A}\in \mathbb {R}^{n\times m}\) and \(\mathbf{B}\in \mathbb {R}^{n\times k}\), the matrix \([\mathbf{A},\mathbf{B}]\in \mathbb {R}^{n\times {(m+k)}}\) is their horizontal concatenation. Given a point \(\mathbf{x}\) and a closed convex set X, the distance between \(\mathbf{x}\) and X is denoted by \(d(\mathbf{x},X)=\min _{\mathbf{y}\in X}\left\| {\mathbf{x}-\mathbf{y}}\right\| \). The standard unit simplex in \(\mathbb {R}^n\) is denoted by \(\Delta _n=\left\{ {\mathbf{x}\in \mathbb {R}_+^n: \left\langle {\mathbf{1}},{\mathbf{x}}\right\rangle =1}\right\} \) and its relative interior by \(\Delta ^+_n=\left\{ {\mathbf{x}\in \mathbb {R}_{++}^n: \left\langle {\mathbf{1}},{\mathbf{x}}\right\rangle =1}\right\} \). Given a set \(X\subseteq \mathbb {R}^n\), its convex hull is denoted by \({{\mathrm{conv}}}(X)\). Given a convex set C, the set of all its extreme points is denoted by \({{\mathrm{ext}}}(C)\).

2 Preliminaries

2.1 Mathematical preliminaries

We start by presenting two technical lemmas. The first lemma is the well known descent lemma which is fundamental in convergence rate analysis of first-order methods. The second lemma is Hoffman’s lemma which is used in various error bound analyses over polyhedral sets.

Lemma 2.1

(The Descent Lemma [3, Proposition A.24]) Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a continuously differentiable function whose gradient is Lipschitz continuous with constant \(\rho \). Then for any \(\mathbf{x},\mathbf{y}\in \mathbb {R}^n\) we have

$$\begin{aligned} f(\mathbf{y})\le f(\mathbf{x})+\left\langle {\nabla f(\mathbf{x})},{\mathbf{y}-\mathbf{x}}\right\rangle +\frac{\rho }{2}\left\| {\mathbf{x}-\mathbf{y}}\right\| ^2. \end{aligned}$$

Lemma 2.2

(Hoffman’s Lemma [14]) Let X be a polyhedron defined by \(X=\left\{ {\mathbf{x}\in \mathbb {R}^n:\mathbf{A}\mathbf{x}\le \mathbf{a}}\right\} \), for some \(\mathbf{A}\in \mathbb {R}^{m\times n}\) and \(\mathbf{a}\in \mathbb {R}^{m}\), and let \(S=\left\{ {\mathbf{x}\in \mathbb {R}^n:\tilde{\mathbf{E}}\mathbf{x}=\tilde{\mathbf{e}}}\right\} \) where \(\tilde{\mathbf{E}}\in \mathbb {R}^{r\times n}\) and \(\tilde{\mathbf{e}}\in \mathbb {R}^r\). Assume that \(X\cap S\ne \emptyset \). Then, there exists a constant \(\theta \), depending only on \(\mathbf{A}\) and \(\tilde{\mathbf{E}}\), such that any \(\mathbf{x}\in X\) satisfies

$$\begin{aligned} d{(\mathbf{x},X\cap S)}\le \theta \left\| {\tilde{\mathbf{E}}\mathbf{x}-\tilde{\mathbf{e}}}\right\| . \end{aligned}$$

A complete and simple proof of this lemma is given in [13, pg. 299–301]. Defining \({\mathcal {B}}\) as the set of all matrices constructed by taking linearly independent rows from the matrix \(\left[ \tilde{\mathbf{E}}^T,\mathbf{A}^T\right] ^T\), we can write \(\theta \) as

$$\begin{aligned} \theta =\max _{\mathbf{B}\in {\mathcal {B}}}\frac{1}{\lambda _{\min }\left( {\mathbf{B}\mathbf{B}^T}\right) }. \end{aligned}$$

We will refer to \(\theta \) as the Hoffman constant associated with matrix \(\left[ \tilde{\mathbf{E}}^T,\mathbf{A}^T\right] ^T\).

2.2 Problem properties

Throughout this paper we make the following assumption regarding problem (P).

Assumption 1

  1. (a)

    f is continuously differentiable and has a Lipschitz continuous gradient with constant \(\rho \).

  2. (b)

    g is strongly convex with parameter \(\sigma _g\).

  3. (c)

    X is a nonempty compact polyhedral set given by \(X=\{\mathbf{x}\in \mathbb {R}^n: \mathbf{A}\mathbf{x}\le \mathbf{a}\}\) for some \(\mathbf{A}\in \mathbb {R}^{m\times n}\), \(\mathbf{a}\in \mathbb {R}^m\).

We assume without loss of generality that \(\mathbf{A}\) does not contain a row of zeros. We denote the optimal solution set of problem (P) by \(X^*\). The diameter of the compact set X is denoted by D, and the diameter of the set \(\mathbf{E}X\) (the diameter of the image of X under the linear mapping associated with matrix \(\mathbf{E}\)) by \(D_\mathbf{E}\). The two diameters satisfy the following relation:

$$\begin{aligned} D_\mathbf{E}=\max _{\mathbf{x},\mathbf{y}\in X}\left\| {\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{y}}\right\| \le \left\| {\mathbf{E}}\right\| \max _{\mathbf{x},\mathbf{y}\in X }\left\| {\mathbf{x}-\mathbf{y}}\right\| =\left\| {\mathbf{E}}\right\| D, \end{aligned}$$

We define \(G\equiv \max _{\mathbf{x}\in X}\left\| {\nabla g(\mathbf{E}\mathbf{x})}\right\| \) to be the maximal norm of the gradient of g over \(\mathbf{E}X\).

Problem (P) possesses some properties, which we present in the following lemmas.

Lemma 2.3

(Lemma 14, [23]) Let \(X^*\) be the optimal set of problem (P). Then, there exists a constant vector \(\mathbf{t}^*\) and a scalar \(s^*\) such that any optimal solution \(\mathbf{x}^*\in X^*\) satisfies \(\mathbf{E}\mathbf{x}^*=\mathbf{t}^*\) and \(\left\langle {\mathbf{b}},{\mathbf{x}^*}\right\rangle =s^*\).

Although the proof of the lemma in the given reference is for polyhedral sets, the extension for any convex set is trivial.

Lemma 2.4

Let \(f^*\) be the optimal value of problem (P). Then, for any \(\mathbf{x}\in X\)

$$\begin{aligned} f(\mathbf{x})-f^*\le {C} \end{aligned}$$

where \(C=GD_\mathbf{E}+\left\| {\mathbf{b}}\right\| D\).

Proof

Let \(\mathbf{x}^*\) be some optimal solution of problem (P), so that \(f(\mathbf{x}^*)=f^*\). Then for any \(\mathbf{x}\in X\), it follows from the convexity of f that

$$\begin{aligned} \begin{aligned} f(\mathbf{x})-f(\mathbf{x}^*)&\le \left\langle {\nabla f(\mathbf{x})},{\mathbf{x}-\mathbf{x}^*}\right\rangle \\&=\left\langle {\nabla g(\mathbf{E}\mathbf{x})},{\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\rangle +\left\langle {\mathbf{b}},{\mathbf{x}-\mathbf{x}^*}\right\rangle \\&\le \left\| {\nabla g(\mathbf{E}\mathbf{x})}\right\| \left\| {\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\| +\left\| {\mathbf{b}}\right\| \left\| {\mathbf{x}-\mathbf{x}^*}\right\| \\&\le GD_\mathbf{E}+\left\| {\mathbf{b}}\right\| D= C \end{aligned} \end{aligned}$$

where the last two inequalities are due to the Cauchy-Schwarz inequality and the definition of G,D and \(D_\mathbf{E}\). \(\square \)

The following lemma provides an error bound, i.e., a bound on the distance of any feasible solution to the optimal set. This error bound will later be used as an alternative to a strong convexity assumption on f, which is usually needed in order to prove a linear rate of convergence. This is a different bound than the one given in [23], since it relies heavily on the compactness of the set X, thus enabling to circumvent the use of the so-called gradient mapping.

Lemma 2.5

For any \(\mathbf{x}\in X\),

$$\begin{aligned} d(\mathbf{x},X^*)^2\le \kappa (f(\mathbf{x})-f^*), \end{aligned}$$

where \(\kappa =\theta ^2 \left( \left\| {\mathbf{b}}\right\| D+3GD_\mathbf{E}+\frac{2(G^2+1)}{\sigma _g}\right) \), and \(\theta \) is the Hoffman constant associated with the matrix \(\left[ \mathbf{A}^T,\mathbf{E}^T,\mathbf{b}\right] ^T\).

Proof

Lemma 2.3 implies that the optimal solution set \(X^*\) can be defined as \(X^*=X\cap S\) where \(S=\left\{ {\mathbf{x}\in \mathbb {R}^n: \mathbf{E}\mathbf{x}=\mathbf{t}^*,\;\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle =s^*}\right\} \) for some \(\mathbf{t}^* \in \mathbb {R}^m\) and \(s^* \in \mathbb {R}\). For any \(\mathbf{x}\in X\), applying Lemma 2.2 with \(\tilde{\mathbf{E}}=\left[ {\mathbf{E}}^T,\mathbf{b}\right] ^T\), we have that

$$\begin{aligned} d(\mathbf{x},X^*)^2&\le \theta ^2 (\left( \left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -s^*\right) ^2+\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| ^2), \end{aligned}$$
(2.1)

where \(\theta \) is the Hoffman constant associated with matrix \(\left[ \mathbf{A}^T,\mathbf{E}^T,\mathbf{b}\right] ^T\). Now, let \(\mathbf{x}\in X\) and \(\mathbf{x}^* \in X^*\). Utilizing the \(\sigma _g\)-strong convexity of g, it follows that

$$\begin{aligned} \left\langle {\nabla g(\mathbf{E}\mathbf{x}^*)},{\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\rangle +\frac{\sigma _g}{2}\left\| {\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\| ^2 \le g(\mathbf{E}\mathbf{x})-g(\mathbf{E}\mathbf{x}^*). \end{aligned}$$
(2.2)

By the first-order optimality conditions for problem (P), we have (recalling that \(\mathbf{x}\in X\) and \(\mathbf{x}^*\in X^*\))

$$\begin{aligned} \left\langle {\nabla f(\mathbf{x}^*)},{\mathbf{x}-\mathbf{x}^*}\right\rangle \ge 0. \end{aligned}$$
(2.3)

Therefore,

$$\begin{aligned} \frac{\sigma _g}{2}\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| ^2&\le \left\langle {\nabla f(\mathbf{x}^*)},{\mathbf{x}-\mathbf{x}^*}\right\rangle +\frac{\sigma _g}{2}\left\| {\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\| ^2\nonumber \\&=\left\langle {\nabla g(\mathbf{E}\mathbf{x}^*)},{\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\rangle +\left\langle {\mathbf{b}},{\mathbf{x}-\mathbf{x}^*}\right\rangle +\frac{\sigma _g}{2} \left\| {\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\| ^2.\nonumber \\ \end{aligned}$$
(2.4)

Now, using (2.2) we can continue (2.4) to obtain

$$\begin{aligned} \frac{\sigma _g}{2}\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| ^2\le g(\mathbf{E}\mathbf{x})-g(\mathbf{E}\mathbf{x}^*)+\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -\left\langle {\mathbf{b}},{\mathbf{x}^*}\right\rangle =f(\mathbf{x})-f(\mathbf{x}^*). \end{aligned}$$
(2.5)

We are left with the task of upper bounding \((\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -s^*)^2\). By the definitions of \(s^*\) and f we have that

$$\begin{aligned} \left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -s^*= & {} \left\langle {\mathbf{b}},{\mathbf{x}-\mathbf{x}^*}\right\rangle \nonumber \\= & {} \left\langle {\nabla f(\mathbf{x}^*)},{\mathbf{x}-\mathbf{x}^*}\right\rangle -\left\langle {\nabla g(\mathbf{E}\mathbf{x}^*)},{\mathbf{E}\mathbf{x}-\mathbf{E}\mathbf{x}^*}\right\rangle \nonumber \\= & {} \left\langle {\nabla f(\mathbf{x}^*)},{\mathbf{x}-\mathbf{x}^*}\right\rangle - \left\langle {\nabla g(\mathbf{t}^*)},{\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\rangle . \end{aligned}$$
(2.6)

Therefore, using (2.3), (2.6) as well as the Cauchy-Schwarz inequality, we can conclude the following:

$$\begin{aligned} s^*-\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle \le \left\langle {\nabla g(\mathbf{t}^*)},{\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\rangle \le \left\| {\nabla g(\mathbf{t}^*)}\right\| \left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| . \end{aligned}$$
(2.7)

On the other hand, exploiting (2.6), the convexity of f and the Cauchy-Schwarz inequality, we also have that

$$\begin{aligned} \left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -s^*= & {} \left\langle {\nabla f(\mathbf{x}^*)},{\mathbf{x}-\mathbf{x}^*}\right\rangle - \left\langle {\nabla g(\mathbf{t}^*)},{\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\rangle \nonumber \\\le & {} f(\mathbf{x})-f^*-\left\langle {\nabla g(\mathbf{t}^*)},{\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\rangle \nonumber \\\le & {} f(\mathbf{x})-f^*+\left\| {\nabla g(\mathbf{t}^*)}\right\| \left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| . \end{aligned}$$
(2.8)

Combining (2.7), (2.8), and the fact that \(f(\mathbf{x})-f^*\ge 0\), we obtain that

$$\begin{aligned} (\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -s^*)^2\le \left( f(\mathbf{x})-f^*+\left\| {\nabla g(\mathbf{t}^*)}\right\| \left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| \right) ^2. \end{aligned}$$
(2.9)

Moreover, the definitions of G and \(D_\mathbf{E}\) imply \(\left\| {\nabla g(\mathbf{t}^*)}\right\| \le G\), \(\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| \le D_\mathbf{E}\), and since \(\mathbf{x}\in X\), it follows from Lemma 2.4 that \(f(\mathbf{x})-f^*\le C=GD_\mathbf{E}+\left\| {\mathbf{b}}\right\| D\). Utilizing these bounds, as well as (2.5) to bound (2.9) results in

$$\begin{aligned} (\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -s^*)^2\le & {} \left( f(\mathbf{x})-f^*+G\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| \right) ^2\nonumber \\= & {} (f(\mathbf{x})-f^*)^2+2G\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| (f(\mathbf{x})-f^*)+G^2\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| ^2\nonumber \\\le & {} (f(\mathbf{x})-f^*)C+2GD_\mathbf{E}(f(\mathbf{x})-f^*)+G^2\frac{2}{\sigma _g}(f(\mathbf{x})-f^*)\nonumber \\= & {} (f(\mathbf{x})-f^*)\left( C+2GD_\mathbf{E}+\frac{2G^2}{\sigma _g}\right) \nonumber \\= & {} (f(\mathbf{x})-f^*)\left( \left\| {\mathbf{b}}\right\| D+3GD_\mathbf{E}+\frac{2G^2}{\sigma _g}\right) . \end{aligned}$$
(2.10)

Plugging (2.5) and (2.10) back into (2.1), we obtain the desired result:

$$\begin{aligned} d(\mathbf{x},X^*)^2&\le \theta ^2 \left( \left\| {\mathbf{b}}\right\| D+3GD_\mathbf{E}+\frac{2(G^2+1)}{\sigma _g}\right) (f(\mathbf{x})-f^*). \end{aligned}$$

\(\square \)

2.3 Conditional gradient and linear oracles

In order to present the CG algorithm, we first define the concept of linear oracles.

Definition 2.1

(Linear Oracle) Given a set X, an operator \({\mathcal {O}}_X:\mathbb {R}^n\rightarrow X\) is called a linear oracle for X, if for each \(\mathbf{c}\in \mathbb {R}^n\) it returns a vector \(\mathbf{p}\in X\) such that \(\left\langle {\mathbf{c}},{\mathbf{p}}\right\rangle \le \left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) for any \(\mathbf{x}\in X\), i.e., \(\mathbf{p}\) is a minimizer of the linear function \(\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) over X.

Linear oracles are black-box type functions, where the actual algorithm used in order to obtain the minimizer is unknown. For many feasible sets, such as \(\ell _p\) balls and specific polyhedral sets, the oracle can be represented by a closed form solution or can be computed by an efficient method.

The CG algorithm and its variants are linear-oracle based algorithms. The original CG algorithm, presented in [8]—also known as the Frank-Wolfe algorithm — is as follows.

figure a

The algorithm is guaranteed to have an \(O(\tfrac{1}{k})\) rate of convergence for stepsize determined according to exact line search [8], adaptive stepsize [18], which has a closed form when the smoothness constant of the objective function is known, and predetermined stepsize [6]. This upper bound on the rate of convergence is tight [5] and therefore variants, such as the ASCG were developed.

3 Away steps conditional gradient

The ASCG algorithm was proposed by Wolfe in [24]. A linear convergence rate was proven for problems consisting of minimizing strongly convex objective functions over polyhedral feasible sets in [12] under some restrictions on the location of the optimal solution. Jaggi and Lacoste-Julien [16] suggested a variation on the original ASCG algorithm, which assumes that a finite representation of the current point is maintained throughout the algorithm, and this representation is used to calculate the away step. The authors then showed that under this modification, it is possible to prove a linear convergence result without a restriction on the optimal solution’s location; this result is also applicable for the specific case of problem (P) where \(\mathbf{b}=\mathbf{0}\) [or more generally \(\mathbf{b}\in {{\mathrm{Im}}}(\mathbf{E})\)], provided that an appropriate linear-oracle is available for the set \(\mathbf{E}X\). From this point on we refer to the latter variant of the ASCG, where a thorough explanation of the difference between the algorithms can be found in [9]. In this section, we extend this result for the general case of problem (P), i.e., for any \(\mathbf{E}\) and \(\mathbf{b}\). Furthermore, we explore the potential issues with obtaining a linear-oracle for the set \(\mathbf{E}X\), and suggest an alternative analysis, which only assumes existence of an appropriate linear-oracle on the original set X. Moreover, our analysis differs from the one presented in [16] by the fact that it is based on duality rather than geometric arguments. This approach enables to derive a computable constant for the rate of convergence, which is explicitly expressed as a function of the problem’s parameters and the geometry of the feasible set.

We separate the discussion on the ASCG method into four sections. In Sect. 3.1 we define the concept of vertex linear oracles, and the issues of obtaining such an oracle for linear transformations of simple sets. Section 3.2 contains a full description of the ASCG method itself, including the concept of vertex representation, and representation reduction. In Sect. 3.3 we present the rate of convergence analysis of the ASCG method for problem (P), as well as introduce the new computable convergence constant \(\Omega _X\). Finally, in Sect. 3.4 we demonstrate how to compute \(\Omega _X\) for three types of simple sets.

3.1 Vertex linear oracles

In order to prove convergence, the ASCG algorithm requires a linear oracle which output is within a finite set of points in X; more specifically, in the case of a polytope, it is natural to assume a vertex linear oracle,Footnote 2 a concept that we now define explicitly.

Definition 3.1

(Vertex Linear Oracle) Given a polyhedral set X with vertex set V, a linear oracle \(\tilde{{\mathcal {O}}}_X:\mathbb {R}^n\rightarrow V\) is called a vertex linear oracle for X, if for each \(\mathbf{c}\in \mathbb {R}^n\) it returns a vertex \(\mathbf{p}\in V\) such that \(\left\langle {\mathbf{c}},{\mathbf{p}}\right\rangle \le \left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) for any \(\mathbf{x}\in X\).

Notice that, according to the fundamental theorem of linear programming [4, Theorem 2.7], the problem of optimizing any linear objective function over the compact set X always has an optimal solution which is a vertex. Therefore, the vertex linear oracle \(\tilde{\mathcal {O}}_X\) is well defined. We also note that in this paper the term “vertex” is synonymous with the term “extreme point”.

In [16], Jaggi and Lacoste-Julien proved that both the CG and the ASCG algorithms are affine invariant. This means that given the problem

$$\begin{aligned} \min _{\mathbf{x}\in X}g(\mathbf{E}\mathbf{x}), \end{aligned}$$
(3.1)

where g is a strongly convex function and \(\mathbf{E}\) is some matrix, applying the ASCG algorithm on the equivalent problem

$$\begin{aligned} \min _{\mathbf{y}\in Y} g(\mathbf{y}), \end{aligned}$$
(3.2)

where \(Y=\mathbf{E}X\), yields a linear rate of convergence, which depends only on the strong convexity parameter of g and the geometry of the set Y (regardless of what \(\mathbf{E}\) generated it). However, assuming that \(\mathbf{E}\) is not of a full column rank, i.e., f is not strongly convex, retrieving an optimal solution \(\mathbf{x}^*\in X\) from the optimal solution \(\mathbf{y}^*\in Y\) requires solving a linear feasibility problem, where we aim to find \(\mathbf{x}^*\in X\) such that \(\mathbf{E}\mathbf{x}^*=\mathbf{y}^*\). This feasibility problem is equivalent to solving the following constrained least squares problem:

$$\begin{aligned} \min _{\mathbf{x}\in X}\left\| {\mathbf{E}\mathbf{x}-\mathbf{y}^*}\right\| ^2 \end{aligned}$$

in the sense that the optimal set of the above problem comprises all vectors satisfying \(\mathbf{x}^*\in X, \mathbf{E}\mathbf{x}^*=\mathbf{y}^*\). For a general \(\mathbf{E}\), solving the least squares problem might be more computationally expensive than simply applying the linear oracle on set X. Moreover, the entire algorithm is then applied on the transformed set \(Y=\mathbf{E}X\), which might be geometrically more complicated than the set X, or may lie in a higher dimension.

It is interesting to note that the vertex oracle property is not preserved when changing the representation. As an example, let us take a naive approach to construct a general linear oracle \({\mathcal {O}}_{\mathbf{E}X}\), given \(\tilde{\mathcal {O}}_X\), by the formula

$$\begin{aligned} {\mathcal {O}}_{\mathbf{E}X}(\mathbf{c})=\mathbf{E}\tilde{\mathcal {O}}_{X}(\mathbf{E}^T\mathbf{c}). \end{aligned}$$
(3.3)

However, the output \(\tilde{\mathbf{p}}={\mathcal {O}}_{\mathbf{E}X}(\mathbf{c})\) of this linear oracle is not guaranteed to be a vertex of \(\mathbf{E}X\) and incurs an additional computational cost associated with the matrix vector multiplications. As an example, take X to be the unit box in three dimensions, \(X=[-1,1]^3\subseteq \mathbb {R}^3\), and let \(\mathbf{E}\) be given by

$$\begin{aligned} \mathbf{E}=\begin{bmatrix}1&\quad 1&\quad 1\\ 1&\quad 1\quad&-1\\ 0&\quad 0&\quad 2\end{bmatrix}. \end{aligned}$$

We denote the vertex set V of the set X by the letters A–H as follows:

$$\begin{aligned} A&=(1,1,1)^T,\;&B&=(1,1,-1)^T,\;&C&=(1,-1,-1)^T,\;&D&=(1,-1,1)^T,\\ E&=(-1,1,1)^T,\;&F&=(-1,-1,1)^T,\;&G&=(-1,1,-1)^T,\;&H&=(-1,-1,-1)^T, \end{aligned}$$

and the linear mappings of these vertices by the matrix \(\mathbf{E}\) by A’–H’:

$$\begin{aligned} A'&=(3,1,2)^T,\;&B'&=(1,3,-2)^T,\;&C'&=G'=(-1,1,-2)^T,\\ F'&=(-1,-3,2)^T,\;&H'&=(-3,-1,-2)^T,\;&D'&=E'=(1,-1,2)^T. \end{aligned}$$

The vertex set of \(\mathbf{E}X\) is \({{\mathrm{ext}}}(\mathbf{E}X)=\{A',B',F',H'\}\).

The sets X and \(\mathbf{E}X\) are presented in Fig. 1. Notice that finding a vertex linear oracle for X is trivial, while finding one for \(\mathbf{E}X\) is not. In particular, a vertex linear oracle for X may be given by any operator \(\tilde{\mathcal {O}}_X(\cdot )\) satisfying

$$\begin{aligned} \tilde{\mathcal {O}}_X(\mathbf{c})\in {\mathop {{{\mathrm{\arg \!\min }}}}\limits _{{\mathbf{x}\in V}}}\left\{ {\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle }\right\} =\left\{ {\mathbf{x}\in \{-1,1\}^3:x_ic_i=-|c_i|,\,\forall i=1,\ldots ,n}\right\} ,\quad \forall \; \mathbf{c}\in \mathbb {R}^3. \end{aligned}$$
(3.4)

Given the vector \(\mathbf{c}=(-1,\;1,\;3)^T\), we want to find

$$\begin{aligned} \mathbf{p}\in {\mathop {{{\mathrm{\arg \!\min }}}}\limits _{\mathbf{y}\in {{\mathrm{ext}}}(\mathbf{E}X)}} \left\langle {\mathbf{c}},{\mathbf{y}}\right\rangle . \end{aligned}$$
Fig. 1
figure 1

The sets X and \(\mathbf{E}X\)

Using the naive approach, described in (3.3), we obtain a vertex of X by applying the vertex linear oracle \(\tilde{\mathcal {O}}_X\) described in (3.4) with parameter \(\mathbf{E}^T\mathbf{c}=(0,\; 0 ,\; 1)\), which may return either one of the vertices B, C, G or H. If vertex C is returned, then its mapping C’ does not yield a vertex in \(\mathbf{E}X\).

We aim to show that given a vertex linear oracle for X, the ASCG algorithm converges in a linear rate for the more general problem (P). Since in our analysis we do not assume the existence of a linear oracle for \(\mathbf{E}X\), but rather a vertex linear oracle for X, the computational cost per iteration related to the linear oracle is independent of the matrix \(\mathbf{E}\), and depends only on the geometry of X.

3.2 The ASCG method

We will now present the ASCG algorithm. In the following we denote the vertex set of X as \(V={{\mathrm{ext}}}(X)\). Moreover, as part of the ASCG algorithm, at each iteration k the iterate \(\mathbf{x}^k\) is represented as a convex combination of points in V. Specifically, \(\mathbf{x}^k\) is assumed to have the representation

$$\begin{aligned} \mathbf{x}^k=\sum _{\mathbf{v}\in V} \mu ^k_\mathbf{v}\mathbf{v}, \end{aligned}$$

where \({\varvec{\mu }}^k\in \Delta _{|V|}\). Let \(U^k=\left\{ {\mathbf{v}\in V:\mu ^k_\mathbf{v}>0}\right\} \); then \(U^k\) and \(\left\{ {\mu ^k_{\mathbf{v}}}\right\} _{\mathbf{v}\in U^k}\) provide a compact representation of \(\mathbf{x}^k\), and \(\mathbf{x}^k\) lies in the relative interior of the set \({{\mathrm{conv}}}(U^k)\). Throughout the algorithm we update \(U^k\) and \({\varvec{\mu }}^k\) via the vertex representation updating (VRU) scheme. The ASCG method has two types of updates: a forward step, used in the classical CG algorithm, where a vertex is added to the representation, and an away step, unique to this algorithm, in which the coefficient of one of the vertices used in the representation is reduced or even nullified. Specifically, the away step uses the direction \((\mathbf{x}^k-\mathbf{u}^k)\) where \(\mathbf{u}^k\in U^k\) and stepsize \(\gamma ^k>0\) so that

$$\begin{aligned} \begin{aligned} \mathbf{x}^{k+1}&=\mathbf{x}^k+\gamma ^k(\mathbf{x}^k-\mathbf{u}^k)\\&=\left( \mathbf{x}^k-\mu ^k_{\mathbf{u}^k}\mathbf{u}^k\right) (1+\gamma ^k)+\left( \mu ^k_{\mathbf{u}^k}-\gamma ^k\left( 1-\mu ^k_{\mathbf{u}^k}\right) \right) \mathbf{u}^k\\&=\sum _{\mathbf{v}\in U^k/\left\{ {\mathbf{u}^k}\right\} }(1+\gamma ^k)\mu _{\mathbf{v}}\mathbf{v}+\left( \mu ^k_{\mathbf{u}^k}(1+\gamma ^k)-\gamma ^k\right) \mathbf{u}^k,\\ \end{aligned} \end{aligned}$$

and so \(\mu ^{k+1}_{\mathbf{u}^k}=\mu ^k_{\mathbf{u}^k}-\gamma ^k\left( 1-\mu ^k_{\mathbf{u}^k}\right) <\mu ^k_{\mathbf{u}^k}\). Moreover, if \(\gamma ^k=\frac{\mu ^k_{\mathbf{u}^k}}{1-\mu ^k_{\mathbf{u}^k}}\), then \(\mu ^{k+1}_{\mathbf{u}^k}\) is nullified, and consequently, the vertex \(\mathbf{u}^k\) is removed from the representation. This vertex removal is referred to as a drop step.

The full description of the ASCG algorithm and the VRU scheme is given as follows.

figure b

The stepsize in the ASCG algorithm can be chosen according to one of the following stepsize selection rules, where \(\mathbf{d}^k\) and \(\overline{\gamma }^k\) are as defined in the algorithm.

$$\begin{aligned} \gamma ^k{\left\{ \begin{array}{ll} \in {\mathop {{{\mathrm{\arg \!\min }}}}\limits _{{0\le \gamma \le \overline{\gamma }^k}}} f(\mathbf{x}^k+\gamma \mathbf{d}^k) &{} \text {Exact line search}\\ =\min \left\{ {-\frac{\left\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\right\rangle }{\rho \left\| {\mathbf{d}^k}\right\| ^2},\overline{\gamma }^k}\right\} \in {\mathop {{{\mathrm{\arg \!\min }}}}\limits _{{0\le \gamma \le \overline{\gamma }^k}}} \left\{ \gamma \left\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\right\rangle +{\gamma }^2\frac{\rho }{2}\left\| {\mathbf{d}^k}\right\| ^2\right\} &{} \text {Adaptive [18].} \end{array}\right. } \end{aligned}$$
(3.5)

Remark 3.1

It is simple to show that under the above two choices of stepsize strategies, the sequence of function values \(\{f(\mathbf{x}^k)\}_{k \ge 1}\) is nonincreasing.

Since the convergence rate analyses for both of these stepsize options is similar, we chose to conduct a unified analysis for both cases. Following is exact definition of the VRU procedure.

figure c

The VRU scheme uses a representation reduction procedure \({\mathcal {R}}\) with constant N, which is a procedure that takes a representation \((U,{\varvec{\mu }})\) of a point \(\mathbf{x}\) and replaces it by a representation \((\tilde{U},\tilde{{\varvec{\mu }}})\) of \(\mathbf{x}\) such that \(\tilde{U}\subseteq U\) and \(|\tilde{U}|\le N\). We consider two possible options for the representation reduction procedure:

  1. 1.

    \({\mathcal {R}}\) is the trivial procedure, meaning it does not change the representation, in which case its constant is \(N=|V|\).

  2. 2.

    The procedure \({\mathcal {R}}\) is some implementation of the Carathéodory theorem [22, Sect. 17], in which case its constant is \(N=n+1\). Although this representation does not influence the convergence results, dealing with more compact representations will reduce the storage space needed for the algorithm, as well as the cost of stage 2 of the ASCG algorithm. This is especially significant when the number of vertices is not polynomial in the problem’s dimension. A full description of the incremental representation reduction (IRR) scheme, which applies the Carathéodory theorem efficiently in this context, is presented in the appendix.

3.3 Rate of convergence analysis

We will now prove the linear rate of convergence for the ASCG algorithm for problem (P). In the following we use \(I(\mathbf{x})\) to denote the index set of the active constraints at \(\mathbf{x}\),

$$\begin{aligned} I(\mathbf{x})=\left\{ {i\in \left\{ {1,\ldots ,n}\right\} : \mathbf{A}_i\mathbf{x}=a_i}\right\} . \end{aligned}$$

Similarly, for a given set U, the set of active constraints for all the points in U is defined as

$$\begin{aligned} I(U)=\left\{ {i\in \left\{ {1,\ldots ,n}\right\} : \mathbf{A}_i\mathbf{v}=a_i,\;\forall \mathbf{v}\in U}\right\} =\bigcap _{\mathbf{v}\in U} I(\mathbf{v}). \end{aligned}$$

We present the following technical lemma, which is similar to a result presented by Jaggi and Lacoste-Julien [16].Footnote 3 In [16] the proof is based on geometrical considerations, and utilizes the so-called “pyramidal width constant”, which is the optimal value of a complicated optimization problem. In contrast, the proof below relies on simple linear programming duality arguments, and in addition, the derived constant \(\Omega _X\), which replaces the pyramidal width constant, is computable for many choices of sets X.

Lemma 3.1

Let \(U\subseteq V\) and \(\mathbf{c}\in \mathbb {R}^n\). If there exists \(\mathbf{z}\in \mathbb {R}^n\) such that \(\mathbf{A}_{I(U)}\mathbf{z}\le 0\) and \(\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle >0\), then

$$\begin{aligned} \max _{\mathbf{p}\in V, \mathbf{u}\in U}\left\langle {\mathbf{c}},{\mathbf{p}-\mathbf{u}}\right\rangle \ge \frac{\Omega _X}{(n+1)}\frac{\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle }{\left\| {\mathbf{z}}\right\| }, \end{aligned}$$

where

$$\begin{aligned} \Omega _X=\min \limits _{\mathbf{v}\in V,i\in \left\{ {1,\ldots ,m}\right\} :a_i>\mathbf{A}_i\mathbf{v}}\frac{a_i-\mathbf{A}_i\mathbf{v}}{ \left\| {\mathbf{A}_i}\right\| }. \end{aligned}$$
(3.6)

Proof

By the fundamental theorem of linear programming [11], we can maximize the function \(\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) on X instead of on V and get the same optimal value. Similarly, we can minimize the function \(\left\langle {\mathbf{c}},{\mathbf{y}}\right\rangle \) on \({{\mathrm{conv}}}(U)\) instead of on U, and obtain the same optimal value. Therefore,

$$\begin{aligned} \max _{\mathbf{p}\in V, \mathbf{u}\in U}\left\langle {\mathbf{c}},{\mathbf{p}-\mathbf{u}}\right\rangle= & {} \max _{\mathbf{p}\in V}\left\langle {\mathbf{c}},{\mathbf{p}}\right\rangle -\min _{\mathbf{u}\in U}\left\langle {\mathbf{c}},{\mathbf{u}}\right\rangle \nonumber \\= & {} \max _{\mathbf{x}\in X} \left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle -\min _{\mathbf{y}\in {{\mathrm{conv}}}(U)}\left\langle {\mathbf{c}},{\mathbf{y}}\right\rangle \nonumber \\= & {} \max _{\mathbf{x}:\mathbf{A}\mathbf{x}\le \mathbf{a}} \left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle +\max _{\mathbf{y}\in {{\mathrm{conv}}}(U)}\left\{ {-\left\langle {\mathbf{c}},{\mathbf{y}}\right\rangle }\right\} . \end{aligned}$$
(3.7)

Since X is nonempty and bounded, the problem in \(\mathbf{x}\) is feasible and bounded above. Therefore, by strong duality for linear programming,

$$\begin{aligned} \max _{\mathbf{x}:\mathbf{A}\mathbf{x}\le \mathbf{a}} \left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle =\min _{{\varvec{\eta }}\in \mathbb {R}^m_+: \mathbf{A}^T{\varvec{\eta }}=\mathbf{c}} \left\langle {\mathbf{a}},{{\varvec{\eta }}}\right\rangle . \end{aligned}$$
(3.8)

Plugging (3.8) back into (3.7) we obtain:

$$\begin{aligned} \max _{\mathbf{p}\in V, \mathbf{u}\in U}\left\langle {\mathbf{c}},{\mathbf{p}-\mathbf{u}}\right\rangle= & {} \min _{{\varvec{\eta }}\in \mathbb {R}^m_+: \mathbf{A}^T{\varvec{\eta }}=\mathbf{c}} \left\langle {\mathbf{a}},{{\varvec{\eta }}}\right\rangle +\max _{\mathbf{y}\in {{\mathrm{conv}}}(U)}\left\{ {-\left\langle {\mathbf{c}},{\mathbf{y}}\right\rangle }\right\} \nonumber \\= & {} \min _{{\varvec{\eta }}\in \mathbb {R}^m_+: \mathbf{A}^T{\varvec{\eta }}=\mathbf{c}}\max _{\mathbf{y}\in {{\mathrm{conv}}}(U)} \left\langle {\mathbf{a}-\mathbf{A}\mathbf{y}},{{\varvec{\eta }}}\right\rangle . \end{aligned}$$
(3.9)

Let \(\tilde{\mathbf{y}}=\frac{1}{|U|}\sum _{\mathbf{v}\in U}\mathbf{v}\). Then by Carathéodory theorem there exists a set \(\tilde{U}\subseteq U\) with cardinality \(|\tilde{U}|\le (n+1)\) such that \(\tilde{\mathbf{y}}\in {{\mathrm{conv}}}(\tilde{U})\). We will prove that \(I(\tilde{U}) = I(U)\). Since \(\tilde{U}\subseteq U\) we have \(I(U)\subseteq I(\tilde{U})\). To prove the reverse inclusion, let \(i\notin I(U)\), then it follows from the definition of \(\tilde{\mathbf{y}}\) that

$$\begin{aligned} a_i-\mathbf{A}_i\tilde{\mathbf{y}}=\frac{1}{|U|}\sum _{\mathbf{u}\in U}(a_i-\mathbf{A}_i\mathbf{u})\ge \frac{1}{|U|} \max _{\mathbf{u}\in U} (a_i-\mathbf{A}_i\mathbf{u})> 0, \end{aligned}$$

and so, since \(\tilde{\mathbf{y}}\in {{\mathrm{conv}}}(\tilde{U})\), there exists \({\varvec{\lambda }}\ge \mathbf{0}\) where \(\mathbf{1}^T {\varvec{\lambda }}=1\) such that \(\tilde{\mathbf{y}} = \sum _{\mathbf{u}\in \tilde{U}} \lambda _{\mathbf{u}}\mathbf{u}\); consequently,

$$\begin{aligned} \max _{\mathbf{u}\in \tilde{U}} (a_i-\mathbf{A}_i\mathbf{u}) \ge \sum _{\mathbf{u}\in \tilde{U}}\lambda _\mathbf{u}(a_i-\mathbf{A}_i\mathbf{u}) = a_i-\mathbf{A}_i\tilde{\mathbf{y}}>0 \end{aligned}$$

which implies that \(i \notin I(\tilde{U})\). Thus, \(I(\tilde{U})\subseteq I(U)\), establishing the fact that \(I(\tilde{U}) = I(U)\). We define \(\bar{\mathbf{y}}=\frac{1}{|\tilde{U}|}\sum _{\mathbf{v}\in \tilde{U}}\mathbf{v}\). Since \(\overline{\mathbf{y}}\in {{\mathrm{conv}}}(\tilde{U}) \subseteq {{\mathrm{conv}}}(U)\), we have that

$$\begin{aligned} \max _{\mathbf{y}\in {{\mathrm{conv}}}(U)} \left\langle {\mathbf{a}-\mathbf{A}\mathbf{y}},{{\varvec{\eta }}}\right\rangle \ge \left\langle {\mathbf{a}-\mathbf{A}\overline{\mathbf{y}}},{{\varvec{\eta }}}\right\rangle \end{aligned}$$

for any value of \({\varvec{\eta }}\), and therefore,

$$\begin{aligned} \min _{{\varvec{\eta }}\in \mathbb {R}^m_+: \mathbf{A}^T{\varvec{\eta }}=\mathbf{c}}\max _{\mathbf{y}\in {{\mathrm{conv}}}(U)} \left\langle {\mathbf{a}-\mathbf{A}\mathbf{y}},{{\varvec{\eta }}}\right\rangle \ge \min _{{\varvec{\eta }}\in \mathbb {R}^m_+: \mathbf{A}^T{\varvec{\eta }}=\mathbf{c}}\left\langle {\mathbf{a}-\mathbf{A}\overline{\mathbf{y}}},{{\varvec{\eta }}}\right\rangle . \end{aligned}$$
(3.10)

Using strong duality on the RHS of (3.10), we obtain that

$$\begin{aligned} \begin{aligned} \min _{{\varvec{\eta }}\in \mathbb {R}^m_+: \mathbf{A}^T{\varvec{\eta }}=\mathbf{c}} \left\langle {\mathbf{a}-\mathbf{A}\overline{\mathbf{y}}},{{\varvec{\eta }}}\right\rangle&=\max _{\mathbf{x}}\left\{ {\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle : \mathbf{A}\mathbf{x}\le (\mathbf{a}-\mathbf{A}\overline{\mathbf{y}})}\right\} . \end{aligned} \end{aligned}$$
(3.11)

Denote \(J=I(\tilde{U})\) and \(\overline{J}=\left\{ {1,\ldots ,m}\right\} /J\). From the definition of \(I(\tilde{U})\), it follows that

$$\begin{aligned} \mathbf{a}_J-\mathbf{A}_J\mathbf{v}=\mathbf{0}\end{aligned}$$
(3.12)

for all \(\mathbf{v}\in \tilde{U}\), and that for any \(i\in \overline{J}\) there exists at least one vertex \(\mathbf{v}\in \tilde{U}\) such that \(a_i-\mathbf{A}_i\mathbf{v}>0\), and hence,

$$\begin{aligned} a_i-\mathbf{A}_i\mathbf{v}\ge \min _{\mathbf{u}\in V: \mathbf{a}_i>\mathbf{A}_i\mathbf{u}}(a_i-\mathbf{A}_i\mathbf{u}), \end{aligned}$$

which in particular implies that

$$\begin{aligned} \frac{1}{\left\| {\mathbf{A}_i}\right\| }\sum _{\mathbf{v}\in \tilde{U}} (a_i -\mathbf{A}_i \mathbf{v}) \ge \frac{1}{\left\| {\mathbf{A}_i}\right\| }\min _{\mathbf{u}\in V: \mathbf{a}_i>\mathbf{A}_i\mathbf{u}}(a_i-\mathbf{A}_i\mathbf{u}) \ge \Omega _X. \end{aligned}$$
(3.13)

We denote \(\overline{\mathbf{A}_i}=\mathbf{A}_i/\left\| {\mathbf{A}_i}\right\| \) and \(\overline{a}_i=a_i/\left\| {\mathbf{A}_i}\right\| \), and similarly, \(\bar{\mathbf{a}}_J\) denotes the vector comprising the components \(\overline{a}_i,i \in J\) and \(\overline{\mathbf{A}}_J\) stands for the matrix whose rows are \(\overline{\mathbf{A}}_i, i \in J\). From the choice of \(\overline{\mathbf{y}}\), we can conclude from (3.12) and (3.13) that

$$\begin{aligned} \begin{aligned} \overline{\mathbf{a}}_J-\overline{\mathbf{A}}_J\overline{\mathbf{y}}&=\mathbf{0},\\ \overline{\mathbf{a}}_{\overline{J}}-\overline{\mathbf{A}}_{\overline{J}}\overline{\mathbf{y}}&=\frac{1}{|\tilde{U}|}\sum _{\mathbf{v}\in \tilde{U}}(\overline{\mathbf{a}}_{\overline{J}}-\overline{\mathbf{A}}_{\overline{J}}\mathbf{v})\ge \mathbf{1}\frac{\Omega _X }{|\tilde{U}|}. \end{aligned} \end{aligned}$$
(3.14)

Therefore, replacing the RHS of the set of inequalities \(\mathbf{A}\mathbf{x}\le (\mathbf{a}-\mathbf{A}\overline{\mathbf{y}})\) in (3.11) by the bounds given in (3.14), we obtain that

$$\begin{aligned} \begin{aligned} \max _{\mathbf{x}}\left\{ {\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle : \mathbf{A}\mathbf{x}\le (\mathbf{a}-\mathbf{A}\overline{\mathbf{y}})}\right\}&\ge \max _{\mathbf{x}}\left\{ {\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle : \overline{\mathbf{A}}_J\mathbf{x}\le \mathbf{0},\;\overline{\mathbf{A}}_{\overline{J}}\mathbf{x}\le \mathbf{1}\frac{\Omega _X}{|\tilde{U}|}}\right\} . \end{aligned} \end{aligned}$$
(3.15)

Combining (3.9),(3.10), (3.11) and (3.15) it follows that

$$\begin{aligned} \max _{\mathbf{p}\in V, \mathbf{u}\in U}\left\langle {\mathbf{c}},{\mathbf{p}-\mathbf{u}}\right\rangle \ge Z^*, \end{aligned}$$
(3.16)

where

$$\begin{aligned} Z^*=\max _{\mathbf{x}}\left\{ {\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle :\overline{\mathbf{A}}_J\mathbf{x}\le \mathbf{0},\;\overline{\mathbf{A}}_{\overline{J}}\mathbf{x}\le \mathbf{1}\frac{\Omega _X }{|\tilde{U}|}}\right\} . \end{aligned}$$
(3.17)

We will now show that it is not possible for \(\mathbf{z}\) to satisfy \(\mathbf{A}_{\overline{J}}\mathbf{z}\le 0\) [(recall that \(J=I(U)=I(\tilde{U})\)]. Suppose by contradiction that \(\mathbf{z}\) satisfies does satisfy \(\mathbf{A}_{\overline{J}}\mathbf{z}\le 0\). Then \({\mathbf{x}}_{\alpha }=\alpha \mathbf{z}\) is a feasible solution of problem (3.17) for any \(\alpha >0\), and since \(\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle >0\) we obtain that \(\left\langle {\mathbf{c}},{{\mathbf{x}}_{\alpha }}\right\rangle \rightarrow \infty \) as \(\alpha \rightarrow \infty \), and thus \(Z^*=\infty \). However, since V contains a finite number of points, the LHS of (3.16) is bounded from above, and so \(Z^*<\infty \) in contradiction. Therefore, there exists \(i\in \overline{J}\) such that \(\mathbf{A}_i\mathbf{z}>0\). Since \(\mathbf{z}\ne \mathbf{0}\), the vector \(\overline{\mathbf{x}}=\frac{\mathbf{z}}{\left\| {\mathbf{z}}\right\| }\frac{\Omega _X}{|\tilde{U}|}\) is well defined. Moreover, \(\overline{\mathbf{x}}\) satisfies

$$\begin{aligned} \overline{\mathbf{A}}_J\overline{\mathbf{x}}=\frac{\Omega _X}{\left\| {\mathbf{z}}\right\| |\tilde{U}|}{\overline{\mathbf{A}}_J\mathbf{z}}\le 0, \end{aligned}$$
(3.18)

and (recalling that \(\Vert \overline{\mathbf{A}}_i\Vert =1\))

$$\begin{aligned} \overline{\mathbf{A}}_i\overline{\mathbf{x}}=\overline{\mathbf{A}}_i\mathbf{z}\frac{\Omega _X}{|\tilde{U}|\left\| {\mathbf{z}}\right\| }\le \left\| {\overline{\mathbf{A}}_i}\right\| \left\| {\mathbf{z}}\right\| \frac{\Omega _X}{|\tilde{U}|\left\| {\mathbf{z}}\right\| }= \frac{\Omega _X}{|\tilde{U}|},\quad \forall i\in \overline{J}, \end{aligned}$$
(3.19)

where the inequality follows from the Cauchy-Schwarz inequality. Consequently, (3.18) and (3.19) imply that \(\overline{\mathbf{x}}\) is a feasible solution for problem (3.17). Therefore, \(Z^*\ge \left\langle {\mathbf{c}},{\overline{\mathbf{x}}}\right\rangle \), which by (3.16) yields

$$\begin{aligned} \max _{\mathbf{p}\in V, \mathbf{u}\in U}\left\langle {\mathbf{c}},{\mathbf{p}-\mathbf{u}}\right\rangle \ge \left\langle {\mathbf{c}},{\overline{\mathbf{x}}}\right\rangle =\frac{\Omega _X}{|\tilde{U}|}\frac{\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle }{\left\| {\mathbf{z}}\right\| }\ge \frac{\Omega _X}{n+1}\frac{\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle }{\left\| {\mathbf{z}}\right\| }, \end{aligned}$$

where the last inequality utilizes the fact that \(|\tilde{U}|\le n+1\). \(\square \)

The constant \(\Omega _X\) represents a normalized minimal distance between the hyperplanes that contain facets of X and the vertices of X which do not lie on those hyperplanes. We will refer to \(\Omega _X\) as the vertex-facet distance of X. Examples for the derivation of \(\Omega _X\) for some simple polyhedral sets can be found in Sect. 3.4.

The following lemma is a technical result stating that the active constraints at a given point are the same as the active constraints of the set of vertices in its compact representation.

Lemma 3.2

Let \(\mathbf{x}\in X\) and the set \(U\subseteq V\) satisfy \(\mathbf{x}=\sum _{\mathbf{v}\in U} \mu _\mathbf{v}\mathbf{v}\), where \({\varvec{\mu }}\in \Delta ^+_{|U|}\). Then \(I(\mathbf{x})=I(U)\).

Proof

It is trivially true that \(I(U)\subseteq I(\mathbf{x})\) since \(\mathbf{x}\) is a convex combination of points in the affine space defined by \(\left\{ {\mathbf{y}:\mathbf{A}_{I(U)}\mathbf{y}=\mathbf{a}_{I(U)}}\right\} \). We will prove that \(I(\mathbf{x})\subseteq I(U)\). Any \(\mathbf{v}\in U\subseteq X\) satisfies \(\mathbf{A}_{I(\mathbf{x})}\mathbf{v}\le \mathbf{a}_{I(\mathbf{x})}\). Assume to the contrary, that there exists \(i\in I(\mathbf{x})\) such that some \(\mathbf{u}\in U\) satisfies \(\mathbf{A}_i\mathbf{u}<a_i\). Since \(\mu _\mathbf{u}>0\) and \(\sum _{\mathbf{v}\in U} \mu _\mathbf{v}=1\), it follows that

$$\begin{aligned} \mathbf{A}_{i}\mathbf{x}=\sum _{\mathbf{v}\in U} \mu _\mathbf{v}\mathbf{A}_i\mathbf{v}< \sum _{\mathbf{v}\in U} \mu _\mathbf{v}a_i =a_i, \end{aligned}$$

in contradiction to the assumption that \(i\in I(\mathbf{x})\). \(\square \)

Corollary 3.1

For any \(\mathbf{x}\in X/X^*\) which can be represented as \(\mathbf{x}=\sum _{\mathbf{v}\in U} \mu _\mathbf{v}\mathbf{v}\) for some \({\varvec{\mu }}\in \Delta ^+_{|U|}\) and \(U\subseteq V\), it holds that,

$$\begin{aligned} \max _{\mathbf{u}\in U,\mathbf{p}\in V} \left\langle {\nabla f(\mathbf{x})},{\mathbf{u}-\mathbf{p}}\right\rangle \ge \frac{\Omega _X}{(n+1)} \max _{\mathbf{x}^*\in X^*}\frac{\left\langle {\nabla f(\mathbf{x})},{\mathbf{x}-\mathbf{x}^*}\right\rangle }{\left\| {\mathbf{x}-\mathbf{x}^*}\right\| }. \end{aligned}$$

Proof

For any \(\mathbf{x}\in X/X^*\) define \(\mathbf{c}=-\nabla f(\mathbf{x})\). It follows from Lemma 3.2 that \(I(U)=I(\mathbf{x})\). For any \(\mathbf{x}^*\in X^*\), the vector \(\mathbf{z}=\mathbf{x}^*-\mathbf{x}\) satisfies

$$\begin{aligned} \mathbf{A}_{I(U)}\mathbf{z}=\mathbf{A}_{I(\mathbf{x})}\mathbf{z}=\mathbf{A}_{I(\mathbf{x})}\mathbf{x}^*-\mathbf{A}_{I(\mathbf{x})}\mathbf{x}\le \mathbf{a}_{I(\mathbf{x})}-\mathbf{a}_{I(\mathbf{x})}=\mathbf{0}, \end{aligned}$$

and, from the convexity of f, as well as the optimality of \(\mathbf{x}^*\), \(\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle =-\langle \nabla f(\mathbf{x}) \mathbf{x}^*-\mathbf{x}\rangle \ge f(\mathbf{x})-f(\mathbf{x}^*)>0\). Therefore, invoking Lemma 3.1 achieves the desired result. \(\square \)

We now present the main theorem of this section, which establishes the linear rate of convergence of the ASCG method for solving problem (P). This theorem is an extension of [16, Thorem 7], and the proof follows the same general arguments, while incorporating the use of the error bound from Lemma 2.5 and the new constant \(\Omega _X\).

Theorem 3.1

Let \(\{\mathbf{x}^k\}_{k \ge 1}\) be the sequence generated by the ASCG algorithm for solving problem (P), and let \(f^*\) be the optimal value of problem (P). Then for any \(k\ge 1\)

$$\begin{aligned} f(\mathbf{x}^k)-f^*\le C(1-\alpha ^{\dag })^{(k-1)/2}, \end{aligned}$$
(3.20)

where

$$\begin{aligned} \alpha ^\dag =\min \left\{ \frac{(\Omega _X)^2}{8\rho \kappa D^2(n+1)^2}, \frac{1}{2} \right\} , \end{aligned}$$
(3.21)

\(\kappa =\theta ^2 \left( \left\| {\mathbf{b}}\right\| D+3GD_\mathbf{E}+\frac{2(G^2+1)}{\sigma _g}\right) \) with \(\theta \) being the Hoffman constant associated with the matrix \(\left[ \mathbf{A}^T,\mathbf{E}^T,\mathbf{b}\right] ^T\), \(C=GD_\mathbf{E}+\left\| {\mathbf{b}}\right\| D\), and \(\Omega _X\) is the vertex-facet distance of X given in (3.6).

Proof

For each k we will denote the stepsize generated by exact line search as \(\gamma _e^k\) and the adaptive stepsize as \(\gamma _a^k\). Then

$$\begin{aligned} \begin{aligned} f\left( \mathbf{x}^k+\gamma _e^k\mathbf{d}^k\right)&\le f(\mathbf{x}^{k+1})\le f\left( \mathbf{x}^k+\gamma _a^k\mathbf{d}^k\right) . \end{aligned} \end{aligned}$$
(3.22)

From Lemma 2.1 (the descent lemma), we have that

$$\begin{aligned} \begin{aligned} f\left( \mathbf{x}^k+\gamma _a^k\mathbf{d}^k\right) \le f(\mathbf{x}^k)+\gamma _a^k\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle +\frac{\left( \gamma _a^k\right) ^2\rho }{2}\Vert \mathbf{d}^k\Vert ^2. \end{aligned} \end{aligned}$$
(3.23)

Assuming that \(\mathbf{x}^k\notin X^*\), then for any \(\mathbf{x}^*\in X^*\) we have that

$$\begin{aligned} \langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle= & {} \min \left\{ { \langle {\nabla f(\mathbf{x}^k)},{\mathbf{p}^k-\mathbf{x}^k}\rangle , \langle {\nabla f(\mathbf{x}^k)},{\mathbf{x}^k-\mathbf{u}^k}\rangle }\right\} \nonumber \\\le & {} \langle {\nabla f(\mathbf{x}^k)},{\mathbf{p}^k-\mathbf{x}^k}\rangle \nonumber \\\le & {} \langle {\nabla f(\mathbf{x}^k)},{\mathbf{x}^*-\mathbf{x}^k}\rangle \nonumber \\\le & {} f^*-f(\mathbf{x}^k), \end{aligned}$$
(3.24)

where the first equality is derived from the algorithm’s specific choice of \(\mathbf{d}^k\), the third line follows from the fact that \(\mathbf{p}^k=\tilde{\mathcal {O}}_X(\nabla f(\mathbf{x}^k))\), and the fourth line follows from the convexity of f. In particular, \(\mathbf{d}^k\ne \mathbf{0}\), and by (3.5), \(\gamma _a^k\) is equal to

$$\begin{aligned} \gamma _a^k=\min \left\{ {-\frac{\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle }{\rho \Vert {\mathbf{d}^k}\Vert ^2},\overline{\gamma }^k}\right\} . \end{aligned}$$
(3.25)

We now separate the analysis to three cases: (a) \(\mathbf{d}^k=\mathbf{p}^k-\mathbf{x}^k\) and \(\gamma _a^k=\overline{\gamma }^k\), (b) \(\mathbf{d}^k=\mathbf{x}^k-\mathbf{u}^k\) and \(\gamma _a^k=\overline{\gamma }^k\), and (c) \(\gamma _a^k<\overline{\gamma }^k\).

In cases (a) and (b), it follows from (3.25) that

$$\begin{aligned} \overline{\gamma }^k\rho \Vert {\mathbf{d}^k}\Vert ^2\le -\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle . \end{aligned}$$
(3.26)

Using inequalities (3.22), (3.23) and (3.26), we obtain

$$\begin{aligned} \begin{aligned} f(\mathbf{x}^{k+1})&\le f(\mathbf{x}^k)+\gamma _a^k\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle +\frac{(\gamma _a^k)^2\rho }{2}\Vert {\mathbf{d}^k}\Vert ^2\\&\le f(\mathbf{x}^k)+\frac{\overline{\gamma }^k}{2}\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle . \end{aligned} \end{aligned}$$

Subtracting \(f^*\) from both sides of the inequality and using (3.24), we have that

$$\begin{aligned} f(\mathbf{x}^{k+1})-f^*\le & {} f(\mathbf{x}^k)-f^*+\frac{\overline{\gamma }^k}{2}\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle \nonumber \\\le & {} (f(\mathbf{x}^k)-f^*)\left( 1-\frac{\overline{\gamma }^k}{2}\right) . \end{aligned}$$
(3.27)

In case (a), \(\overline{\gamma }^k=1\), and hence

$$\begin{aligned} f(\mathbf{x}^{k+1})-f^*\le \frac{f(\mathbf{x}^k)-f^*}{2}. \end{aligned}$$
(3.28)

In case (b), we have no positive lower bound on \(\overline{\gamma }^k\), and therefore we can only conclude, by the nonnegativity of \(\overline{\gamma }^k\), that

$$\begin{aligned} f(\mathbf{x}^{k+1})-f^* \le {f(\mathbf{x}^k)-f^*}. \end{aligned}$$

However, case (b) is a drop step, meaning in particular that \(|U^{k+1}|\le |U^{k}|-1\), since before applying the representation reduction procedure \({\mathcal {R}}\), we eliminate one of the vertices in the representation of \(\mathbf{x}^{k}\). Denoting the number of drop steps until iteration k as \(s^k\), and the number of forward steps until iteration k as \(l^k\), it follows from the algorithm’s definition that \(l^k+s^k\le k-1\) (at each iteration we add a vertex, remove a vertex, or neither) and \(s^k\le l^k\) (the number of removed vertices can not exceed the number of added vertices), and therefore \(s^k\le (k-1)/2\).

We arrive to case (c). In this case, (3.25) implies

$$\begin{aligned} \gamma ^k_a=-\frac{\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle }{\rho \Vert {\mathbf{d}^k}\Vert ^2}, \end{aligned}$$

which combined with (3.22) and (3.23) results in

$$\begin{aligned} f(\mathbf{x}^{k+1}) \le f(\mathbf{x}^k)+\gamma _a^k\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle +\frac{\left( \gamma _a^k\right) ^2\rho }{2}\Vert {\mathbf{d}^k}\Vert ^2 =f(\mathbf{x}^k)-\frac{\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle ^2}{2\rho \Vert {\mathbf{d}^k}\Vert ^2}. \end{aligned}$$
(3.29)

From the algorithm’s specific choice of \(\mathbf{d}^k\), we obtain that

$$\begin{aligned} \begin{aligned} 0\ge \langle {\nabla f(\mathbf{x}^k)},{\mathbf{p}^k-\mathbf{u}^k}\rangle&=\langle {\nabla f(\mathbf{x}^k)},{\mathbf{p}^k-\mathbf{x}^k}\rangle +\langle {\nabla f(\mathbf{x}^k)},{\mathbf{x}^k-\mathbf{u}^k}\rangle \\&\ge 2\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle . \end{aligned} \end{aligned}$$
(3.30)

Applying the bound in (3.30) and the inequality \(\left\| {\mathbf{d}^k}\right\| \le D\) to (3.29), it follows that

$$\begin{aligned} f(\mathbf{x}^{k+1}) \le f(\mathbf{x}^k)-\frac{\langle {\nabla f(\mathbf{x}^k)},{\mathbf{d}^k}\rangle ^2}{2\rho \Vert {\mathbf{d}^k}\Vert ^2}\le f(\mathbf{x}^k)-\frac{\langle {\nabla f(\mathbf{x}^k)},{\mathbf{p}^k-\mathbf{u}^k}\rangle ^2}{8\rho D^2}. \end{aligned}$$
(3.31)

By the definitions of \(\mathbf{u}^k\) and \(\mathbf{p}^k\), Corollary 3.1 implies that for any \(\mathbf{x}^*\in X^*\),

$$\begin{aligned} \langle {\nabla f(\mathbf{x}^k)},{\mathbf{u}^k-\mathbf{p}^k}\rangle =\max _{\mathbf{p}\in V, \mathbf{u}\in U^k}\langle {\nabla f(\mathbf{x}^k)},{\mathbf{u}-\mathbf{p}}\rangle \ge \frac{\Omega _X}{n+1}\frac{\langle {\nabla f(\mathbf{x}^k)},{\mathbf{x}^k-\mathbf{x}^*}\rangle }{\Vert {\mathbf{x}^k-\mathbf{x}^*}\Vert }. \end{aligned}$$
(3.32)

Lemma 2.5 implies that there exists \(\mathbf{x}^*\in X^*\) such that \(\Vert {\mathbf{x}^k-\mathbf{x}^*}\Vert ^2\le \kappa (f(\mathbf{x}^k)-f^*)\), which combined with convexity of f, bounds (3.32) from below as follows:

$$\begin{aligned} \begin{aligned} \langle {\nabla f(\mathbf{x}^k)},{\mathbf{u}^k-\mathbf{p}^k}\rangle ^2&\ge \left( \frac{\Omega _X}{n+1}\right) ^2\frac{\left\langle {\nabla f(\mathbf{x}^k)},{\mathbf{x}^k-\mathbf{x}^*}\right\rangle ^2}{\left\| {\mathbf{x}^k-\mathbf{x}^*}\right\| ^2}\\&\ge \left( \frac{\Omega _X}{n+1}\right) ^2\frac{(f(\mathbf{x}^k)-f(\mathbf{x}^*))^2}{\left\| {\mathbf{x}^k-\mathbf{x}^*}\right\| ^2}\\&\ge \left( \frac{\Omega _X}{n+1}\right) ^2\frac{(f(\mathbf{x}^k)-f^*)^2}{\kappa (f(\mathbf{x}^k)-f^*)}\\&=\frac{(\Omega _X)^2}{(n+1)^2\kappa }(f(\mathbf{x}^k)-f^*), \end{aligned} \end{aligned}$$

which along with (3.31) yields

$$\begin{aligned} f(\mathbf{x}^{k+1})-f^*\le & {} f(\mathbf{x}^k)-f^*-\frac{\langle {\nabla f(\mathbf{x}^k)},{\mathbf{u}^k-\mathbf{p}^k}\rangle ^2}{8\rho D^2}\nonumber \\\le & {} (f(\mathbf{x}^k)-f^*)\left( 1-\frac{(\Omega _X)^2}{8\rho \kappa D^2(n+1)^2}\right) \end{aligned}$$
(3.33)

Therefore, if either of the cases (a) or (c) occurs, then by (3.28) and (3.33), it follows that

$$\begin{aligned} f(\mathbf{x}^{k+1})-f^* \le (1-\alpha ^{\dag }) (f(\mathbf{x}^k)-f^*), \end{aligned}$$
(3.34)

where \(\alpha ^{\dag }\) is defined in (3.21). We can therefore conclude from cases (a)–(c) that until iteration k we have at least \(\frac{k-1}{2}\) iterations for which (3.34) holds, and therefore

$$\begin{aligned} \begin{aligned} f(\mathbf{x}^{k})-f^*&\le (f(\mathbf{x}^1)-f^*)(1-\alpha ^\dag )^{(k-1)/2}. \end{aligned} \end{aligned}$$
(3.35)

Applying Lemma 2.4 for \(\mathbf{x}=\mathbf{x}^1\) we obtain \(f(\mathbf{x}^1)-f^*\le C\), and the desired result (3.20) follows. \(\square \)

Remark 3.2

Notice that while the Hoffman constant enables us to conduct the analysis despite the existence of the linear term in the objective function, the convergence result that follows is no longer affine invariant. This is a direct result from Lemma 2.5, which is needed in order to deal with the linear part of the objective.

3.4 Examples of computing the vertex-facet distance \(\Omega _X\)

In this section, we demonstrate how to compute the vertex-facet distance constant \(\Omega _X\) for a few simple polyhedral sets. We consider three sets: the unit simplex, the \(\ell _1\) ball and the \(\ell _\infty \) ball. We first describe each of the sets as a system of linear inequalities of the form \(X=\left\{ {\mathbf{x}:\mathbf{A}\mathbf{x}\le \mathbf{a}}\right\} \). Then, given the parameters \(\mathbf{A}\) and \(\mathbf{a}\), as well as the vertex set V, \(\Omega _X\) can be computed by its definition, given by (3.6). For the unit simplex and the \(\ell _\infty \) ball we also compare the constant that arises from our analysis and the pyramidal width constant presented in [16], with values presented more recently in [17], and discuss both tightness and ease of computation of each.

The \({\ell _\infty }\) ball. The \(\ell _\infty \) ball is represented by

$$\begin{aligned} \mathbf{A}=\begin{bmatrix}\mathbf{I}\\bI \end{bmatrix}\in \mathbb {R}^{2n\times n},\; \mathbf{a}=\begin{bmatrix}\mathbf{1}\\\mathbf{1}\end{bmatrix}\in \mathbb {R}^{2n}. \end{aligned}$$
(3.36)

The set of extreme points is given by \(V=\left\{ {-1, 1 }\right\} ^n\), which in particular implies that \(|V|=2^n\). Therefore, for large-scale problems, using the representation reduction procedure, which is based on Carathéodory theorem, is crucial in order to obtain a practical implementation.

From the definition of \(\mathbf{A}\) and V, it follows that

$$\begin{aligned} \left\| {\mathbf{A}_i}\right\| =\left\| {-\mathbf{A}_{n+i}}\right\| =\left\| {\mathbf{e}_i}\right\| =1,\quad i=1,\ldots ,n \end{aligned}$$

and so

$$\begin{aligned} \Omega _X=\min _{i\in \left\{ {1,\ldots ,n}\right\} ,\;\mathbf{v}\in \{-1, 1\}^n: \left\langle {\mathbf{e}_i},{\mathbf{v}}\right\rangle <1}{(1-\left\langle {\mathbf{e}_i},{\mathbf{v}}\right\rangle )}=2. \end{aligned}$$

Comparing this result with the pyramidal width of the cube, which by [17] is equal to \({2}/{\sqrt{n}}\) for the given two-unit cube, we must recall that the pyramidal width is compared to the \({\Omega _X}/{n}=2/n\) since the latter is actually a lower bound on the former. Thus we lose a factor of \(\sqrt{n}\) using the looser bound \(\Omega _X\). However, looking at the proof in [17], the calculation of the pyramidal width relies strongly on geometrical considerations (e.g., symmetry of the cube), and the properties of the algorithm (the forward step being the maximizer over the chosen direction), while the computation of \(\Omega _X\) is straightforward.

The unit simplex. The unit simplex \(\Delta _n\) can be represented by

(3.37)

The set of extreme points is given by \(V=\left\{ {\mathbf{e}_i}\right\} _{i=1}^n\). Notice that since there are only n extreme points which are all affinely independent, using a rank reduction procedure which implements the Carathéodory theorem is the same as applying the trivial procedure that does not change the representation. In order to calculate \(\Omega _X\), we first note that \(I(V)=\{n+1,n+2\}\), and therefore

$$\begin{aligned} \left\| {\mathbf{A}_i}\right\| = \left\| {\mathbf{e}_i}\right\| =1,\quad i=1,\ldots n \end{aligned}$$

and

$$\begin{aligned} \Omega _X=\min \limits _{\mathbf{v}\in \left\{ {\mathbf{e}_j}\right\} _{j=1}^n ,i\in \left\{ {1,\ldots ,n}\right\} :-\left\langle {\mathbf{e}_i},{\mathbf{v}}\right\rangle <0}\left\langle {\mathbf{e}_i},{\mathbf{v}}\right\rangle =\min \limits _{i\in \left\{ {1,\ldots ,n}\right\} }\left\| {\mathbf{e}_i}\right\| ^2=1. \end{aligned}$$

Comparing this result with the pyramidal width of the unit simplex, which by [17] is given by \({2}/{\sqrt{n}}\), while we show \({\Omega _X}/{n}=1/n\) and again we lose a factor of \(\sqrt{n}\) using the looser bound \(\Omega _X\). The proof in [17] relies on a geometrical result discussing the width of the simplex, which is not easily derived, while the computation of \(\Omega _X\) is a direct result of V being the standard basis in \(\mathbb {R}^n\).

The \({\ell _1}\) ball. The \(\ell _1\) ball is given by the set

$$\begin{aligned} {X}=\left\{ {\mathbf{x}\in \mathbb {R}^n:\sum \limits _{i=1}^n |x_i|\le 1}\right\} =\left\{ {\mathbf{x}\in \mathbb {R}^n:\left\langle {\mathbf{w}},{\mathbf{x}}\right\rangle \le 1,\forall \mathbf{w}\in \left\{ {-1,1}\right\} ^n}\right\} . \end{aligned}$$

Therefore \(\mathbf{a}=\mathbf{1}\in \mathbb {R}^{2^n}\) and each row of the matrix \(\mathbf{A}\in \mathbb {R}^{2^n\times n}\) is a vector in \(\left\{ {-1,1}\right\} ^n\). The set of extreme points is given by \(V=\left\{ {\mathbf{e}_i}\right\} _{i=1}^n\bigcup \left\{ {-\mathbf{e}_i}\right\} _{i=1}^n\), and therefore has cardinality of \(|V|=2n\).

Finally, we have that

$$\begin{aligned} \left\| {\mathbf{A}_i}\right\| =\sqrt{n},\quad i\in \left\{ {1,\ldots ,2^n}\right\} , \end{aligned}$$

and so

$$\begin{aligned} \begin{aligned} \Omega _X&=\frac{1}{\sqrt{n}}\min _{\mathbf{v}\in V,\mathbf{w}\in \{-1,1\}^n: \left\langle {\mathbf{v}},{\mathbf{w}}\right\rangle<1}(1-\left\langle {\mathbf{v}},{\mathbf{w}}\right\rangle )\\&=\frac{1}{\sqrt{n}}\min _{i\in \left\{ {1,\ldots ,n}\right\} ,\;\mathbf{w}\in \{-1,1\}^n: \left\langle {\mathbf{e}_i},{\mathbf{w}}\right\rangle <1}(1-\left\langle {\mathbf{e}_i},{\mathbf{w}}\right\rangle )\\&=\frac{1}{\sqrt{n}}\min _{\mathbf{w}\in \{-1,1\}^n}(1+|w_i|)=\frac{2}{\sqrt{n}}. \end{aligned} \end{aligned}$$

The pyramidal width of the \(\ell _1\) ball is obviously smaller than that of the simplex, and it is difficult to calculate via the approach suggested in [17].

A recent work of Peña and Rodríguez [20], which appeared after the submition of this paper, showed that the pyramidal width is equal to a quantity referred to as the facial distance of the polytope. This quantity is actually a refinement of the vertex-facet distance, since it calculates the distance between any face of the polytope and the polytope generated by the convex-hull of the vertices not contained in that face. The vertex-facet distance is the distance from a vertex to the hyperplane containing a single polytope facet, disregarding the additional information about the polytope. Therefore, it follows that the bound generated by the vertex-facet distance can be arbitrarily smaller than the facial distance. However, the combinatorial cost of computing the vertex-facet distance is generally linearly dependent on the the number of vertices and facets, while the facial distance generally requires a computation which is exponentially dependent on the number of vertices (equivalent to the number of polytope faces).