Abstract
We consider the problem of minimizing the sum of a linear function and a composition of a strongly convex function with a linear transformation over a compact polyhedral set. Jaggi and Lacoste-Julien (An affine invariant linear convergence analysis for Frank-Wolfe algorithms. NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends, 2014) show that the conditional gradient method with away steps — employed on the aforementioned problem without the additional linear term — has a linear rate of convergence, depending on the so-called pyramidal width of the feasible set. We revisit this result and provide a variant of the algorithm and an analysis based on simple linear programming duality arguments, as well as corresponding error bounds. This new analysis (a) enables the incorporation of the additional linear term, and (b) depends on a new constant, that is explicitly expressed in terms of the problem’s parameters and the geometry of the feasible set. This constant replaces the pyramidal width, which is difficult to evaluate.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider the minimization problem
where \(X \subseteq \mathbb {R}^n\) is a compact polyhedral set, \(\mathbf{E}\in \mathbb {R}^{m \times n}, \mathbf{b}\in \mathbb {R}^n\) and \(g: \mathbb {R}^m \rightarrow \mathbb {R}\) is strongly convex and continuously differentiable over \(\mathbb {R}^m\). Note that for a general matrix \(\mathbf{E}\), the function f is not necessarily strongly convex.
When the problem at hand is large-scale, first-order methods, which have relatively low computational cost per iteration, are usually utilized. These methods include, for example, the class of projected (proximal) gradient methods. A drawback of these methods is that under general convexity assumptions, they posses only a sublinear rate of convergence [2, 19], while a linear rate of convergence can be established only under additional conditions such as strong convexity of the objective function [19]. Luo and Tseng [21] show that the strong convexity assumption can be relaxed and replaced by an assumption on the existence of a local error bound, and under this assumption, certain classes of algorithms, which they refer to as “feasible descent methods”, converge in an asymptotic linear time. The model (P) with assumptions on strong convexity of g, compactness and polyhedrality of X was shown in [21] to satisfy the error bound. Wang and Lin [23] extend [21] and show that there exists a global error bound for problem (P) with the additional assumption of compactness of X, and derive the exact linear rate for this case. Note that the family of “feasible descent methods” includes the block alternating minimization algorithm (under the assumption of block strong convexity), as well as gradient projection methods, and therefore are usually at least as complex as evaluating at each iteration the orthogonal projection operator onto the feasible set X.
An alternative to algorithms based on projection (or proximal) operators are linear-oracle-based algorithms such as the conditional gradient (CG) method. The CG algorithm was presented by Frank and Wolfe [8] in 1956 for minimizing a convex function over a compact polyhedral set. At each iteration, the algorithm requires a solution to the problem of minimizing a linear objective function over the feasible set. It is assumed that this solution is obtained by a call to a linear-oracle, i.e., a black box which, given a linear function, returns an optimal solution of this linear function over the feasible set (see an exact definition in Sect. 2.3). In some instances, and specifically for certain types of polyhedral sets, obtaining such a linear-oracle can be done more efficiently than computing the orthogonal projection onto the feasible set (see examples in [10]), and therefore the CG algorithm has an advantage over projection-based algorithms. The original paper of Frank and Wolfe also contains a proof of an O(1 / k) rate of convergence of the function values to the optimal value. Levitin and Polyak [18] show that this O(1 / k) rate can also be extended to the case where the feasible set is a general compact convex set. Cannon and Culum [5] prove that this rate is in fact tight. However, if in addition to strong convexity of the objective function, the optimal solution is in the interior of the feasible set, then linear rate of convergence of the CG method can be established [12].Footnote 1 Epelman and Freund [7], as well as Beck and Teboulle [1], show a linear rate of convergence of the conditional gradient with a special stepsize choice in the context of finding a point in the intersection of an affine space and a closed and convex set under a Slater-type assumption. Another setting in which linear rate of convergence can be derived is when the feasible set is uniformly (strongly) convex and the norm of the gradient of the objective function is bounded away from zero [18].
Another approach for deriving a linear rate of convergence is to modify the algorithm. For example, Hazan and Garber [10] use local linear-oracles in order to show linear rate of convergence of a “localized” version of the conditional gradient method. A different modification is to use a variation of the conditional gradient method that incorporates away steps. This version of the conditional gradient method, which we refer to as away steps conditional gradient (ASCG), was initially suggested by Wolfe [24] and then studied by Guelat and Marcotte [12], where a linear rate of convergence was established under the assumption that the objective function is strongly convex and the set is a compact polytope, as well as an assumption on the location of the optimal solution. A modified version of the away step method, which calculates the away step on a simplex type representation of the problem, was studied by Jaggi and Lacoste-Julien [16], which were able to show linear convergence for the more general model (P) for the case where \(\mathbf{b}=\mathbf{0}\), without restrictions on the location of the solution. Jaggi proves [15] that the rate of convergence of any method which only adds or removes one atom at a time must be at least sublinear in the first n iterations with n being the dimension of the space. Although the original ASCG suggested by Wolfe and the modified ASCG discussed in [16] are closely related, they have some significant differences. Specifically, as stated in recent work of Freund, Grigas and Mazumder [9], in order to analyze the convergence of this new modification of the ASCG algorithm, it is required that the linear-oracle produces outputs which reside in a certain finite set of points (e.g., the set of extreme points in the case of a polytope), while the analysis of the original ASCG in [12] does not require such an assumption. In this paper we will refer to the variation presented in [16] as “the ASCG method”, and call its required oracle a vertex linear-oracle (see the discussion in Sect. 3.1).
Contribution. In this work, our starting point and main motivation are the results of Jaggi and Lacoste-Julien [16]. Our contribution is twofold:
-
(a)
We extend the results given in [16] and show that the ASCG algorithm converges linearly for the general case of problem (P), that is, for any value of \(\mathbf{E}\) and \(\mathbf{b}\).
The additional linear term \(\langle \mathbf{b},\mathbf{x}\rangle \) enables us to consider much more general models. For example, consider the \(l_1\)-regularized least squares problem \(\min _{\mathbf{x}\in S} \{\left\| {\mathbf{B}\mathbf{x}-\mathbf{c}}\right\| _2^2+\lambda \Vert \mathbf{x}\Vert _1\},\) where \(S \subseteq \mathbb {R}^n\) is a compact polyhedral set, \(\mathbf{B}\in \mathbb {R}^{k \times n}, \mathbf{c}\in \mathbb {R}^k\) and \(\lambda >0\). Since S is compact, there exists a constant \(M>0\) for which \(\Vert \mathbf{x}\Vert _1 \le M\) for any optimal solution \(\mathbf{x}\).We can now rewrite the model as
$$\begin{aligned} \min _{\mathbf{x}\in S, \Vert \mathbf{x}\Vert _1 \le y, y \in [0,M]} \left\| {\mathbf{B}\mathbf{x}-\mathbf{c}}\right\| ^2+\lambda y, \end{aligned}$$which obviously fits the general model (P).
-
(b)
We present an analysis based on simple linear programming duality arguments, which are completely different than the geometric arguments in [16]. Consequently, we obtain a computable constant for the rate of convergence, which is explicitly expressed as a function of the problem’s parameters and the geometry of the feasible set. This constant, which we call “the vertex-facet distance constant”, replaces the so-called pyramidal width constant from [16], which reflects the geometry of the feasible set and is obtained as the optimal value of a very complex mixed integer saddle point optimization problem whose exact value is unknown even for simple polyhedral sets.
Paper outline. The paper is organized as follows. Section 2 presents some preliminary results and definitions needed for the analysis. In particular, it provides a brief introduction to the classical CG algorithm and linear oracles. Section 3 presents the ASCG algorithm and the convergence analysis, and is divided into four subsections. In Sect. 3.1 the concept of vertex linear-oracle, needed for the implementation of ASCG, is presented, and the difficulties of obtaining a vertex linear-oracle on a linear transformation of the feasible set are discussed. In Sect. 3.2 we present the ASCG method with different possible stepsize choices. In Sect. 3.3, we provide the rate of convergence analysis of the ASCG for problem (P), and present the new vertex-facet distance constant used in the analysis. In Sect. 3.4, we demonstrate how to compute this new constant for a few examples of simple polyhedral sets.
Notations. We denote the cardinality of set I by |I|. The difference, union and intersection of two given sets I and J are denoted by \(I{\setminus }J=\left\{ {a\in I: a\notin J}\right\} \), \(I\cup J\) and \(I\cap J\) respectively. Subscript indices represent elements of a vector, while superscript indices represent iterates of the vector, i.e., \(x_i\) is the \(i\hbox {th}\) element of vector \(\mathbf{x}\), \(\mathbf{x}^k\) is a vector at iteration k, and \(x^k_i\) is the \(i\hbox {th}\) element of \(\mathbf{x}^k\). The vector \(\mathbf{e}_i\in \mathbb {R}^n\) is the \(i\hbox {th}\) vector of the standard basis of \(\mathbb {R}^n\), \(\mathbf{0}\in \mathbb {R}^n\) is the all-zeros vector, and \(\mathbf{1}\in \mathbb {R}^n\) is the vector of all ones. Given two vectors \(\mathbf{x},\,\mathbf{y}\in \mathbb {R}^n\), their dot product is denoted by \(\left\langle {\mathbf{x}},{\mathbf{y}}\right\rangle \). Given a matrix \(\mathbf{A}\in \mathbb {R}^{m\times n}\) and vector \(\mathbf{x}\in \mathbb {R}^n\), \(\left\| {\mathbf{A}}\right\| \) denotes the spectral norm of \(\mathbf{A}\), and \(\left\| {\mathbf{x}}\right\| \) denotes the \(\ell _2\)-norm of \(\mathbf{x}\), unless stated otherwise. \(\mathbf{A}^T\), \({{\mathrm{rank}}}(\mathbf{A})\) and \({{\mathrm{Im}}}(\mathbf{A})\) represent the transpose, rank and image of \(\mathbf{A}\) respectively. We denote the \(i\hbox {th}\) row of a given matrix \(\mathbf{A}\) by \(\mathbf{A}_i\), and given a set \(I\subseteq \left\{ {1,\ldots ,m}\right\} \), \(\mathbf{A}_I\in \mathbb {R}^{|I|\times n}\) is the submatrix of \(\mathbf{A}\) such that \((\mathbf{A}_I)_{j}=\mathbf{A}_{I_j}\) for any \(j=1,\ldots , |I|\). If \(\mathbf{A}\) is a symmetric matrix, then \(\lambda _{\min }\left( {\mathbf{A}}\right) \) is its minimal eigenvalue. If a matrix \(\mathbf{A}\) is also invertible, we denote its inverse by \(\mathbf{A}^{-1}\). Given matrices \(\mathbf{A}\in \mathbb {R}^{n\times m}\) and \(\mathbf{B}\in \mathbb {R}^{n\times k}\), the matrix \([\mathbf{A},\mathbf{B}]\in \mathbb {R}^{n\times {(m+k)}}\) is their horizontal concatenation. Given a point \(\mathbf{x}\) and a closed convex set X, the distance between \(\mathbf{x}\) and X is denoted by \(d(\mathbf{x},X)=\min _{\mathbf{y}\in X}\left\| {\mathbf{x}-\mathbf{y}}\right\| \). The standard unit simplex in \(\mathbb {R}^n\) is denoted by \(\Delta _n=\left\{ {\mathbf{x}\in \mathbb {R}_+^n: \left\langle {\mathbf{1}},{\mathbf{x}}\right\rangle =1}\right\} \) and its relative interior by \(\Delta ^+_n=\left\{ {\mathbf{x}\in \mathbb {R}_{++}^n: \left\langle {\mathbf{1}},{\mathbf{x}}\right\rangle =1}\right\} \). Given a set \(X\subseteq \mathbb {R}^n\), its convex hull is denoted by \({{\mathrm{conv}}}(X)\). Given a convex set C, the set of all its extreme points is denoted by \({{\mathrm{ext}}}(C)\).
2 Preliminaries
2.1 Mathematical preliminaries
We start by presenting two technical lemmas. The first lemma is the well known descent lemma which is fundamental in convergence rate analysis of first-order methods. The second lemma is Hoffman’s lemma which is used in various error bound analyses over polyhedral sets.
Lemma 2.1
(The Descent Lemma [3, Proposition A.24]) Let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) be a continuously differentiable function whose gradient is Lipschitz continuous with constant \(\rho \). Then for any \(\mathbf{x},\mathbf{y}\in \mathbb {R}^n\) we have
Lemma 2.2
(Hoffman’s Lemma [14]) Let X be a polyhedron defined by \(X=\left\{ {\mathbf{x}\in \mathbb {R}^n:\mathbf{A}\mathbf{x}\le \mathbf{a}}\right\} \), for some \(\mathbf{A}\in \mathbb {R}^{m\times n}\) and \(\mathbf{a}\in \mathbb {R}^{m}\), and let \(S=\left\{ {\mathbf{x}\in \mathbb {R}^n:\tilde{\mathbf{E}}\mathbf{x}=\tilde{\mathbf{e}}}\right\} \) where \(\tilde{\mathbf{E}}\in \mathbb {R}^{r\times n}\) and \(\tilde{\mathbf{e}}\in \mathbb {R}^r\). Assume that \(X\cap S\ne \emptyset \). Then, there exists a constant \(\theta \), depending only on \(\mathbf{A}\) and \(\tilde{\mathbf{E}}\), such that any \(\mathbf{x}\in X\) satisfies
A complete and simple proof of this lemma is given in [13, pg. 299–301]. Defining \({\mathcal {B}}\) as the set of all matrices constructed by taking linearly independent rows from the matrix \(\left[ \tilde{\mathbf{E}}^T,\mathbf{A}^T\right] ^T\), we can write \(\theta \) as
We will refer to \(\theta \) as the Hoffman constant associated with matrix \(\left[ \tilde{\mathbf{E}}^T,\mathbf{A}^T\right] ^T\).
2.2 Problem properties
Throughout this paper we make the following assumption regarding problem (P).
Assumption 1
-
(a)
f is continuously differentiable and has a Lipschitz continuous gradient with constant \(\rho \).
-
(b)
g is strongly convex with parameter \(\sigma _g\).
-
(c)
X is a nonempty compact polyhedral set given by \(X=\{\mathbf{x}\in \mathbb {R}^n: \mathbf{A}\mathbf{x}\le \mathbf{a}\}\) for some \(\mathbf{A}\in \mathbb {R}^{m\times n}\), \(\mathbf{a}\in \mathbb {R}^m\).
We assume without loss of generality that \(\mathbf{A}\) does not contain a row of zeros. We denote the optimal solution set of problem (P) by \(X^*\). The diameter of the compact set X is denoted by D, and the diameter of the set \(\mathbf{E}X\) (the diameter of the image of X under the linear mapping associated with matrix \(\mathbf{E}\)) by \(D_\mathbf{E}\). The two diameters satisfy the following relation:
We define \(G\equiv \max _{\mathbf{x}\in X}\left\| {\nabla g(\mathbf{E}\mathbf{x})}\right\| \) to be the maximal norm of the gradient of g over \(\mathbf{E}X\).
Problem (P) possesses some properties, which we present in the following lemmas.
Lemma 2.3
(Lemma 14, [23]) Let \(X^*\) be the optimal set of problem (P). Then, there exists a constant vector \(\mathbf{t}^*\) and a scalar \(s^*\) such that any optimal solution \(\mathbf{x}^*\in X^*\) satisfies \(\mathbf{E}\mathbf{x}^*=\mathbf{t}^*\) and \(\left\langle {\mathbf{b}},{\mathbf{x}^*}\right\rangle =s^*\).
Although the proof of the lemma in the given reference is for polyhedral sets, the extension for any convex set is trivial.
Lemma 2.4
Let \(f^*\) be the optimal value of problem (P). Then, for any \(\mathbf{x}\in X\)
where \(C=GD_\mathbf{E}+\left\| {\mathbf{b}}\right\| D\).
Proof
Let \(\mathbf{x}^*\) be some optimal solution of problem (P), so that \(f(\mathbf{x}^*)=f^*\). Then for any \(\mathbf{x}\in X\), it follows from the convexity of f that
where the last two inequalities are due to the Cauchy-Schwarz inequality and the definition of G,D and \(D_\mathbf{E}\). \(\square \)
The following lemma provides an error bound, i.e., a bound on the distance of any feasible solution to the optimal set. This error bound will later be used as an alternative to a strong convexity assumption on f, which is usually needed in order to prove a linear rate of convergence. This is a different bound than the one given in [23], since it relies heavily on the compactness of the set X, thus enabling to circumvent the use of the so-called gradient mapping.
Lemma 2.5
For any \(\mathbf{x}\in X\),
where \(\kappa =\theta ^2 \left( \left\| {\mathbf{b}}\right\| D+3GD_\mathbf{E}+\frac{2(G^2+1)}{\sigma _g}\right) \), and \(\theta \) is the Hoffman constant associated with the matrix \(\left[ \mathbf{A}^T,\mathbf{E}^T,\mathbf{b}\right] ^T\).
Proof
Lemma 2.3 implies that the optimal solution set \(X^*\) can be defined as \(X^*=X\cap S\) where \(S=\left\{ {\mathbf{x}\in \mathbb {R}^n: \mathbf{E}\mathbf{x}=\mathbf{t}^*,\;\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle =s^*}\right\} \) for some \(\mathbf{t}^* \in \mathbb {R}^m\) and \(s^* \in \mathbb {R}\). For any \(\mathbf{x}\in X\), applying Lemma 2.2 with \(\tilde{\mathbf{E}}=\left[ {\mathbf{E}}^T,\mathbf{b}\right] ^T\), we have that
where \(\theta \) is the Hoffman constant associated with matrix \(\left[ \mathbf{A}^T,\mathbf{E}^T,\mathbf{b}\right] ^T\). Now, let \(\mathbf{x}\in X\) and \(\mathbf{x}^* \in X^*\). Utilizing the \(\sigma _g\)-strong convexity of g, it follows that
By the first-order optimality conditions for problem (P), we have (recalling that \(\mathbf{x}\in X\) and \(\mathbf{x}^*\in X^*\))
Therefore,
Now, using (2.2) we can continue (2.4) to obtain
We are left with the task of upper bounding \((\left\langle {\mathbf{b}},{\mathbf{x}}\right\rangle -s^*)^2\). By the definitions of \(s^*\) and f we have that
Therefore, using (2.3), (2.6) as well as the Cauchy-Schwarz inequality, we can conclude the following:
On the other hand, exploiting (2.6), the convexity of f and the Cauchy-Schwarz inequality, we also have that
Combining (2.7), (2.8), and the fact that \(f(\mathbf{x})-f^*\ge 0\), we obtain that
Moreover, the definitions of G and \(D_\mathbf{E}\) imply \(\left\| {\nabla g(\mathbf{t}^*)}\right\| \le G\), \(\left\| {\mathbf{E}\mathbf{x}-\mathbf{t}^*}\right\| \le D_\mathbf{E}\), and since \(\mathbf{x}\in X\), it follows from Lemma 2.4 that \(f(\mathbf{x})-f^*\le C=GD_\mathbf{E}+\left\| {\mathbf{b}}\right\| D\). Utilizing these bounds, as well as (2.5) to bound (2.9) results in
Plugging (2.5) and (2.10) back into (2.1), we obtain the desired result:
\(\square \)
2.3 Conditional gradient and linear oracles
In order to present the CG algorithm, we first define the concept of linear oracles.
Definition 2.1
(Linear Oracle) Given a set X, an operator \({\mathcal {O}}_X:\mathbb {R}^n\rightarrow X\) is called a linear oracle for X, if for each \(\mathbf{c}\in \mathbb {R}^n\) it returns a vector \(\mathbf{p}\in X\) such that \(\left\langle {\mathbf{c}},{\mathbf{p}}\right\rangle \le \left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) for any \(\mathbf{x}\in X\), i.e., \(\mathbf{p}\) is a minimizer of the linear function \(\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) over X.
Linear oracles are black-box type functions, where the actual algorithm used in order to obtain the minimizer is unknown. For many feasible sets, such as \(\ell _p\) balls and specific polyhedral sets, the oracle can be represented by a closed form solution or can be computed by an efficient method.
The CG algorithm and its variants are linear-oracle based algorithms. The original CG algorithm, presented in [8]—also known as the Frank-Wolfe algorithm — is as follows.
The algorithm is guaranteed to have an \(O(\tfrac{1}{k})\) rate of convergence for stepsize determined according to exact line search [8], adaptive stepsize [18], which has a closed form when the smoothness constant of the objective function is known, and predetermined stepsize [6]. This upper bound on the rate of convergence is tight [5] and therefore variants, such as the ASCG were developed.
3 Away steps conditional gradient
The ASCG algorithm was proposed by Wolfe in [24]. A linear convergence rate was proven for problems consisting of minimizing strongly convex objective functions over polyhedral feasible sets in [12] under some restrictions on the location of the optimal solution. Jaggi and Lacoste-Julien [16] suggested a variation on the original ASCG algorithm, which assumes that a finite representation of the current point is maintained throughout the algorithm, and this representation is used to calculate the away step. The authors then showed that under this modification, it is possible to prove a linear convergence result without a restriction on the optimal solution’s location; this result is also applicable for the specific case of problem (P) where \(\mathbf{b}=\mathbf{0}\) [or more generally \(\mathbf{b}\in {{\mathrm{Im}}}(\mathbf{E})\)], provided that an appropriate linear-oracle is available for the set \(\mathbf{E}X\). From this point on we refer to the latter variant of the ASCG, where a thorough explanation of the difference between the algorithms can be found in [9]. In this section, we extend this result for the general case of problem (P), i.e., for any \(\mathbf{E}\) and \(\mathbf{b}\). Furthermore, we explore the potential issues with obtaining a linear-oracle for the set \(\mathbf{E}X\), and suggest an alternative analysis, which only assumes existence of an appropriate linear-oracle on the original set X. Moreover, our analysis differs from the one presented in [16] by the fact that it is based on duality rather than geometric arguments. This approach enables to derive a computable constant for the rate of convergence, which is explicitly expressed as a function of the problem’s parameters and the geometry of the feasible set.
We separate the discussion on the ASCG method into four sections. In Sect. 3.1 we define the concept of vertex linear oracles, and the issues of obtaining such an oracle for linear transformations of simple sets. Section 3.2 contains a full description of the ASCG method itself, including the concept of vertex representation, and representation reduction. In Sect. 3.3 we present the rate of convergence analysis of the ASCG method for problem (P), as well as introduce the new computable convergence constant \(\Omega _X\). Finally, in Sect. 3.4 we demonstrate how to compute \(\Omega _X\) for three types of simple sets.
3.1 Vertex linear oracles
In order to prove convergence, the ASCG algorithm requires a linear oracle which output is within a finite set of points in X; more specifically, in the case of a polytope, it is natural to assume a vertex linear oracle,Footnote 2 a concept that we now define explicitly.
Definition 3.1
(Vertex Linear Oracle) Given a polyhedral set X with vertex set V, a linear oracle \(\tilde{{\mathcal {O}}}_X:\mathbb {R}^n\rightarrow V\) is called a vertex linear oracle for X, if for each \(\mathbf{c}\in \mathbb {R}^n\) it returns a vertex \(\mathbf{p}\in V\) such that \(\left\langle {\mathbf{c}},{\mathbf{p}}\right\rangle \le \left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) for any \(\mathbf{x}\in X\).
Notice that, according to the fundamental theorem of linear programming [4, Theorem 2.7], the problem of optimizing any linear objective function over the compact set X always has an optimal solution which is a vertex. Therefore, the vertex linear oracle \(\tilde{\mathcal {O}}_X\) is well defined. We also note that in this paper the term “vertex” is synonymous with the term “extreme point”.
In [16], Jaggi and Lacoste-Julien proved that both the CG and the ASCG algorithms are affine invariant. This means that given the problem
where g is a strongly convex function and \(\mathbf{E}\) is some matrix, applying the ASCG algorithm on the equivalent problem
where \(Y=\mathbf{E}X\), yields a linear rate of convergence, which depends only on the strong convexity parameter of g and the geometry of the set Y (regardless of what \(\mathbf{E}\) generated it). However, assuming that \(\mathbf{E}\) is not of a full column rank, i.e., f is not strongly convex, retrieving an optimal solution \(\mathbf{x}^*\in X\) from the optimal solution \(\mathbf{y}^*\in Y\) requires solving a linear feasibility problem, where we aim to find \(\mathbf{x}^*\in X\) such that \(\mathbf{E}\mathbf{x}^*=\mathbf{y}^*\). This feasibility problem is equivalent to solving the following constrained least squares problem:
in the sense that the optimal set of the above problem comprises all vectors satisfying \(\mathbf{x}^*\in X, \mathbf{E}\mathbf{x}^*=\mathbf{y}^*\). For a general \(\mathbf{E}\), solving the least squares problem might be more computationally expensive than simply applying the linear oracle on set X. Moreover, the entire algorithm is then applied on the transformed set \(Y=\mathbf{E}X\), which might be geometrically more complicated than the set X, or may lie in a higher dimension.
It is interesting to note that the vertex oracle property is not preserved when changing the representation. As an example, let us take a naive approach to construct a general linear oracle \({\mathcal {O}}_{\mathbf{E}X}\), given \(\tilde{\mathcal {O}}_X\), by the formula
However, the output \(\tilde{\mathbf{p}}={\mathcal {O}}_{\mathbf{E}X}(\mathbf{c})\) of this linear oracle is not guaranteed to be a vertex of \(\mathbf{E}X\) and incurs an additional computational cost associated with the matrix vector multiplications. As an example, take X to be the unit box in three dimensions, \(X=[-1,1]^3\subseteq \mathbb {R}^3\), and let \(\mathbf{E}\) be given by
We denote the vertex set V of the set X by the letters A–H as follows:
and the linear mappings of these vertices by the matrix \(\mathbf{E}\) by A’–H’:
The vertex set of \(\mathbf{E}X\) is \({{\mathrm{ext}}}(\mathbf{E}X)=\{A',B',F',H'\}\).
The sets X and \(\mathbf{E}X\) are presented in Fig. 1. Notice that finding a vertex linear oracle for X is trivial, while finding one for \(\mathbf{E}X\) is not. In particular, a vertex linear oracle for X may be given by any operator \(\tilde{\mathcal {O}}_X(\cdot )\) satisfying
Given the vector \(\mathbf{c}=(-1,\;1,\;3)^T\), we want to find
Using the naive approach, described in (3.3), we obtain a vertex of X by applying the vertex linear oracle \(\tilde{\mathcal {O}}_X\) described in (3.4) with parameter \(\mathbf{E}^T\mathbf{c}=(0,\; 0 ,\; 1)\), which may return either one of the vertices B, C, G or H. If vertex C is returned, then its mapping C’ does not yield a vertex in \(\mathbf{E}X\).
We aim to show that given a vertex linear oracle for X, the ASCG algorithm converges in a linear rate for the more general problem (P). Since in our analysis we do not assume the existence of a linear oracle for \(\mathbf{E}X\), but rather a vertex linear oracle for X, the computational cost per iteration related to the linear oracle is independent of the matrix \(\mathbf{E}\), and depends only on the geometry of X.
3.2 The ASCG method
We will now present the ASCG algorithm. In the following we denote the vertex set of X as \(V={{\mathrm{ext}}}(X)\). Moreover, as part of the ASCG algorithm, at each iteration k the iterate \(\mathbf{x}^k\) is represented as a convex combination of points in V. Specifically, \(\mathbf{x}^k\) is assumed to have the representation
where \({\varvec{\mu }}^k\in \Delta _{|V|}\). Let \(U^k=\left\{ {\mathbf{v}\in V:\mu ^k_\mathbf{v}>0}\right\} \); then \(U^k\) and \(\left\{ {\mu ^k_{\mathbf{v}}}\right\} _{\mathbf{v}\in U^k}\) provide a compact representation of \(\mathbf{x}^k\), and \(\mathbf{x}^k\) lies in the relative interior of the set \({{\mathrm{conv}}}(U^k)\). Throughout the algorithm we update \(U^k\) and \({\varvec{\mu }}^k\) via the vertex representation updating (VRU) scheme. The ASCG method has two types of updates: a forward step, used in the classical CG algorithm, where a vertex is added to the representation, and an away step, unique to this algorithm, in which the coefficient of one of the vertices used in the representation is reduced or even nullified. Specifically, the away step uses the direction \((\mathbf{x}^k-\mathbf{u}^k)\) where \(\mathbf{u}^k\in U^k\) and stepsize \(\gamma ^k>0\) so that
and so \(\mu ^{k+1}_{\mathbf{u}^k}=\mu ^k_{\mathbf{u}^k}-\gamma ^k\left( 1-\mu ^k_{\mathbf{u}^k}\right) <\mu ^k_{\mathbf{u}^k}\). Moreover, if \(\gamma ^k=\frac{\mu ^k_{\mathbf{u}^k}}{1-\mu ^k_{\mathbf{u}^k}}\), then \(\mu ^{k+1}_{\mathbf{u}^k}\) is nullified, and consequently, the vertex \(\mathbf{u}^k\) is removed from the representation. This vertex removal is referred to as a drop step.
The full description of the ASCG algorithm and the VRU scheme is given as follows.
The stepsize in the ASCG algorithm can be chosen according to one of the following stepsize selection rules, where \(\mathbf{d}^k\) and \(\overline{\gamma }^k\) are as defined in the algorithm.
Remark 3.1
It is simple to show that under the above two choices of stepsize strategies, the sequence of function values \(\{f(\mathbf{x}^k)\}_{k \ge 1}\) is nonincreasing.
Since the convergence rate analyses for both of these stepsize options is similar, we chose to conduct a unified analysis for both cases. Following is exact definition of the VRU procedure.
The VRU scheme uses a representation reduction procedure \({\mathcal {R}}\) with constant N, which is a procedure that takes a representation \((U,{\varvec{\mu }})\) of a point \(\mathbf{x}\) and replaces it by a representation \((\tilde{U},\tilde{{\varvec{\mu }}})\) of \(\mathbf{x}\) such that \(\tilde{U}\subseteq U\) and \(|\tilde{U}|\le N\). We consider two possible options for the representation reduction procedure:
-
1.
\({\mathcal {R}}\) is the trivial procedure, meaning it does not change the representation, in which case its constant is \(N=|V|\).
-
2.
The procedure \({\mathcal {R}}\) is some implementation of the Carathéodory theorem [22, Sect. 17], in which case its constant is \(N=n+1\). Although this representation does not influence the convergence results, dealing with more compact representations will reduce the storage space needed for the algorithm, as well as the cost of stage 2 of the ASCG algorithm. This is especially significant when the number of vertices is not polynomial in the problem’s dimension. A full description of the incremental representation reduction (IRR) scheme, which applies the Carathéodory theorem efficiently in this context, is presented in the appendix.
3.3 Rate of convergence analysis
We will now prove the linear rate of convergence for the ASCG algorithm for problem (P). In the following we use \(I(\mathbf{x})\) to denote the index set of the active constraints at \(\mathbf{x}\),
Similarly, for a given set U, the set of active constraints for all the points in U is defined as
We present the following technical lemma, which is similar to a result presented by Jaggi and Lacoste-Julien [16].Footnote 3 In [16] the proof is based on geometrical considerations, and utilizes the so-called “pyramidal width constant”, which is the optimal value of a complicated optimization problem. In contrast, the proof below relies on simple linear programming duality arguments, and in addition, the derived constant \(\Omega _X\), which replaces the pyramidal width constant, is computable for many choices of sets X.
Lemma 3.1
Let \(U\subseteq V\) and \(\mathbf{c}\in \mathbb {R}^n\). If there exists \(\mathbf{z}\in \mathbb {R}^n\) such that \(\mathbf{A}_{I(U)}\mathbf{z}\le 0\) and \(\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle >0\), then
where
Proof
By the fundamental theorem of linear programming [11], we can maximize the function \(\left\langle {\mathbf{c}},{\mathbf{x}}\right\rangle \) on X instead of on V and get the same optimal value. Similarly, we can minimize the function \(\left\langle {\mathbf{c}},{\mathbf{y}}\right\rangle \) on \({{\mathrm{conv}}}(U)\) instead of on U, and obtain the same optimal value. Therefore,
Since X is nonempty and bounded, the problem in \(\mathbf{x}\) is feasible and bounded above. Therefore, by strong duality for linear programming,
Plugging (3.8) back into (3.7) we obtain:
Let \(\tilde{\mathbf{y}}=\frac{1}{|U|}\sum _{\mathbf{v}\in U}\mathbf{v}\). Then by Carathéodory theorem there exists a set \(\tilde{U}\subseteq U\) with cardinality \(|\tilde{U}|\le (n+1)\) such that \(\tilde{\mathbf{y}}\in {{\mathrm{conv}}}(\tilde{U})\). We will prove that \(I(\tilde{U}) = I(U)\). Since \(\tilde{U}\subseteq U\) we have \(I(U)\subseteq I(\tilde{U})\). To prove the reverse inclusion, let \(i\notin I(U)\), then it follows from the definition of \(\tilde{\mathbf{y}}\) that
and so, since \(\tilde{\mathbf{y}}\in {{\mathrm{conv}}}(\tilde{U})\), there exists \({\varvec{\lambda }}\ge \mathbf{0}\) where \(\mathbf{1}^T {\varvec{\lambda }}=1\) such that \(\tilde{\mathbf{y}} = \sum _{\mathbf{u}\in \tilde{U}} \lambda _{\mathbf{u}}\mathbf{u}\); consequently,
which implies that \(i \notin I(\tilde{U})\). Thus, \(I(\tilde{U})\subseteq I(U)\), establishing the fact that \(I(\tilde{U}) = I(U)\). We define \(\bar{\mathbf{y}}=\frac{1}{|\tilde{U}|}\sum _{\mathbf{v}\in \tilde{U}}\mathbf{v}\). Since \(\overline{\mathbf{y}}\in {{\mathrm{conv}}}(\tilde{U}) \subseteq {{\mathrm{conv}}}(U)\), we have that
for any value of \({\varvec{\eta }}\), and therefore,
Using strong duality on the RHS of (3.10), we obtain that
Denote \(J=I(\tilde{U})\) and \(\overline{J}=\left\{ {1,\ldots ,m}\right\} /J\). From the definition of \(I(\tilde{U})\), it follows that
for all \(\mathbf{v}\in \tilde{U}\), and that for any \(i\in \overline{J}\) there exists at least one vertex \(\mathbf{v}\in \tilde{U}\) such that \(a_i-\mathbf{A}_i\mathbf{v}>0\), and hence,
which in particular implies that
We denote \(\overline{\mathbf{A}_i}=\mathbf{A}_i/\left\| {\mathbf{A}_i}\right\| \) and \(\overline{a}_i=a_i/\left\| {\mathbf{A}_i}\right\| \), and similarly, \(\bar{\mathbf{a}}_J\) denotes the vector comprising the components \(\overline{a}_i,i \in J\) and \(\overline{\mathbf{A}}_J\) stands for the matrix whose rows are \(\overline{\mathbf{A}}_i, i \in J\). From the choice of \(\overline{\mathbf{y}}\), we can conclude from (3.12) and (3.13) that
Therefore, replacing the RHS of the set of inequalities \(\mathbf{A}\mathbf{x}\le (\mathbf{a}-\mathbf{A}\overline{\mathbf{y}})\) in (3.11) by the bounds given in (3.14), we obtain that
Combining (3.9),(3.10), (3.11) and (3.15) it follows that
where
We will now show that it is not possible for \(\mathbf{z}\) to satisfy \(\mathbf{A}_{\overline{J}}\mathbf{z}\le 0\) [(recall that \(J=I(U)=I(\tilde{U})\)]. Suppose by contradiction that \(\mathbf{z}\) satisfies does satisfy \(\mathbf{A}_{\overline{J}}\mathbf{z}\le 0\). Then \({\mathbf{x}}_{\alpha }=\alpha \mathbf{z}\) is a feasible solution of problem (3.17) for any \(\alpha >0\), and since \(\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle >0\) we obtain that \(\left\langle {\mathbf{c}},{{\mathbf{x}}_{\alpha }}\right\rangle \rightarrow \infty \) as \(\alpha \rightarrow \infty \), and thus \(Z^*=\infty \). However, since V contains a finite number of points, the LHS of (3.16) is bounded from above, and so \(Z^*<\infty \) in contradiction. Therefore, there exists \(i\in \overline{J}\) such that \(\mathbf{A}_i\mathbf{z}>0\). Since \(\mathbf{z}\ne \mathbf{0}\), the vector \(\overline{\mathbf{x}}=\frac{\mathbf{z}}{\left\| {\mathbf{z}}\right\| }\frac{\Omega _X}{|\tilde{U}|}\) is well defined. Moreover, \(\overline{\mathbf{x}}\) satisfies
and (recalling that \(\Vert \overline{\mathbf{A}}_i\Vert =1\))
where the inequality follows from the Cauchy-Schwarz inequality. Consequently, (3.18) and (3.19) imply that \(\overline{\mathbf{x}}\) is a feasible solution for problem (3.17). Therefore, \(Z^*\ge \left\langle {\mathbf{c}},{\overline{\mathbf{x}}}\right\rangle \), which by (3.16) yields
where the last inequality utilizes the fact that \(|\tilde{U}|\le n+1\). \(\square \)
The constant \(\Omega _X\) represents a normalized minimal distance between the hyperplanes that contain facets of X and the vertices of X which do not lie on those hyperplanes. We will refer to \(\Omega _X\) as the vertex-facet distance of X. Examples for the derivation of \(\Omega _X\) for some simple polyhedral sets can be found in Sect. 3.4.
The following lemma is a technical result stating that the active constraints at a given point are the same as the active constraints of the set of vertices in its compact representation.
Lemma 3.2
Let \(\mathbf{x}\in X\) and the set \(U\subseteq V\) satisfy \(\mathbf{x}=\sum _{\mathbf{v}\in U} \mu _\mathbf{v}\mathbf{v}\), where \({\varvec{\mu }}\in \Delta ^+_{|U|}\). Then \(I(\mathbf{x})=I(U)\).
Proof
It is trivially true that \(I(U)\subseteq I(\mathbf{x})\) since \(\mathbf{x}\) is a convex combination of points in the affine space defined by \(\left\{ {\mathbf{y}:\mathbf{A}_{I(U)}\mathbf{y}=\mathbf{a}_{I(U)}}\right\} \). We will prove that \(I(\mathbf{x})\subseteq I(U)\). Any \(\mathbf{v}\in U\subseteq X\) satisfies \(\mathbf{A}_{I(\mathbf{x})}\mathbf{v}\le \mathbf{a}_{I(\mathbf{x})}\). Assume to the contrary, that there exists \(i\in I(\mathbf{x})\) such that some \(\mathbf{u}\in U\) satisfies \(\mathbf{A}_i\mathbf{u}<a_i\). Since \(\mu _\mathbf{u}>0\) and \(\sum _{\mathbf{v}\in U} \mu _\mathbf{v}=1\), it follows that
in contradiction to the assumption that \(i\in I(\mathbf{x})\). \(\square \)
Corollary 3.1
For any \(\mathbf{x}\in X/X^*\) which can be represented as \(\mathbf{x}=\sum _{\mathbf{v}\in U} \mu _\mathbf{v}\mathbf{v}\) for some \({\varvec{\mu }}\in \Delta ^+_{|U|}\) and \(U\subseteq V\), it holds that,
Proof
For any \(\mathbf{x}\in X/X^*\) define \(\mathbf{c}=-\nabla f(\mathbf{x})\). It follows from Lemma 3.2 that \(I(U)=I(\mathbf{x})\). For any \(\mathbf{x}^*\in X^*\), the vector \(\mathbf{z}=\mathbf{x}^*-\mathbf{x}\) satisfies
and, from the convexity of f, as well as the optimality of \(\mathbf{x}^*\), \(\left\langle {\mathbf{c}},{\mathbf{z}}\right\rangle =-\langle \nabla f(\mathbf{x}) \mathbf{x}^*-\mathbf{x}\rangle \ge f(\mathbf{x})-f(\mathbf{x}^*)>0\). Therefore, invoking Lemma 3.1 achieves the desired result. \(\square \)
We now present the main theorem of this section, which establishes the linear rate of convergence of the ASCG method for solving problem (P). This theorem is an extension of [16, Thorem 7], and the proof follows the same general arguments, while incorporating the use of the error bound from Lemma 2.5 and the new constant \(\Omega _X\).
Theorem 3.1
Let \(\{\mathbf{x}^k\}_{k \ge 1}\) be the sequence generated by the ASCG algorithm for solving problem (P), and let \(f^*\) be the optimal value of problem (P). Then for any \(k\ge 1\)
where
\(\kappa =\theta ^2 \left( \left\| {\mathbf{b}}\right\| D+3GD_\mathbf{E}+\frac{2(G^2+1)}{\sigma _g}\right) \) with \(\theta \) being the Hoffman constant associated with the matrix \(\left[ \mathbf{A}^T,\mathbf{E}^T,\mathbf{b}\right] ^T\), \(C=GD_\mathbf{E}+\left\| {\mathbf{b}}\right\| D\), and \(\Omega _X\) is the vertex-facet distance of X given in (3.6).
Proof
For each k we will denote the stepsize generated by exact line search as \(\gamma _e^k\) and the adaptive stepsize as \(\gamma _a^k\). Then
From Lemma 2.1 (the descent lemma), we have that
Assuming that \(\mathbf{x}^k\notin X^*\), then for any \(\mathbf{x}^*\in X^*\) we have that
where the first equality is derived from the algorithm’s specific choice of \(\mathbf{d}^k\), the third line follows from the fact that \(\mathbf{p}^k=\tilde{\mathcal {O}}_X(\nabla f(\mathbf{x}^k))\), and the fourth line follows from the convexity of f. In particular, \(\mathbf{d}^k\ne \mathbf{0}\), and by (3.5), \(\gamma _a^k\) is equal to
We now separate the analysis to three cases: (a) \(\mathbf{d}^k=\mathbf{p}^k-\mathbf{x}^k\) and \(\gamma _a^k=\overline{\gamma }^k\), (b) \(\mathbf{d}^k=\mathbf{x}^k-\mathbf{u}^k\) and \(\gamma _a^k=\overline{\gamma }^k\), and (c) \(\gamma _a^k<\overline{\gamma }^k\).
In cases (a) and (b), it follows from (3.25) that
Using inequalities (3.22), (3.23) and (3.26), we obtain
Subtracting \(f^*\) from both sides of the inequality and using (3.24), we have that
In case (a), \(\overline{\gamma }^k=1\), and hence
In case (b), we have no positive lower bound on \(\overline{\gamma }^k\), and therefore we can only conclude, by the nonnegativity of \(\overline{\gamma }^k\), that
However, case (b) is a drop step, meaning in particular that \(|U^{k+1}|\le |U^{k}|-1\), since before applying the representation reduction procedure \({\mathcal {R}}\), we eliminate one of the vertices in the representation of \(\mathbf{x}^{k}\). Denoting the number of drop steps until iteration k as \(s^k\), and the number of forward steps until iteration k as \(l^k\), it follows from the algorithm’s definition that \(l^k+s^k\le k-1\) (at each iteration we add a vertex, remove a vertex, or neither) and \(s^k\le l^k\) (the number of removed vertices can not exceed the number of added vertices), and therefore \(s^k\le (k-1)/2\).
We arrive to case (c). In this case, (3.25) implies
which combined with (3.22) and (3.23) results in
From the algorithm’s specific choice of \(\mathbf{d}^k\), we obtain that
Applying the bound in (3.30) and the inequality \(\left\| {\mathbf{d}^k}\right\| \le D\) to (3.29), it follows that
By the definitions of \(\mathbf{u}^k\) and \(\mathbf{p}^k\), Corollary 3.1 implies that for any \(\mathbf{x}^*\in X^*\),
Lemma 2.5 implies that there exists \(\mathbf{x}^*\in X^*\) such that \(\Vert {\mathbf{x}^k-\mathbf{x}^*}\Vert ^2\le \kappa (f(\mathbf{x}^k)-f^*)\), which combined with convexity of f, bounds (3.32) from below as follows:
which along with (3.31) yields
Therefore, if either of the cases (a) or (c) occurs, then by (3.28) and (3.33), it follows that
where \(\alpha ^{\dag }\) is defined in (3.21). We can therefore conclude from cases (a)–(c) that until iteration k we have at least \(\frac{k-1}{2}\) iterations for which (3.34) holds, and therefore
Applying Lemma 2.4 for \(\mathbf{x}=\mathbf{x}^1\) we obtain \(f(\mathbf{x}^1)-f^*\le C\), and the desired result (3.20) follows. \(\square \)
Remark 3.2
Notice that while the Hoffman constant enables us to conduct the analysis despite the existence of the linear term in the objective function, the convergence result that follows is no longer affine invariant. This is a direct result from Lemma 2.5, which is needed in order to deal with the linear part of the objective.
3.4 Examples of computing the vertex-facet distance \(\Omega _X\)
In this section, we demonstrate how to compute the vertex-facet distance constant \(\Omega _X\) for a few simple polyhedral sets. We consider three sets: the unit simplex, the \(\ell _1\) ball and the \(\ell _\infty \) ball. We first describe each of the sets as a system of linear inequalities of the form \(X=\left\{ {\mathbf{x}:\mathbf{A}\mathbf{x}\le \mathbf{a}}\right\} \). Then, given the parameters \(\mathbf{A}\) and \(\mathbf{a}\), as well as the vertex set V, \(\Omega _X\) can be computed by its definition, given by (3.6). For the unit simplex and the \(\ell _\infty \) ball we also compare the constant that arises from our analysis and the pyramidal width constant presented in [16], with values presented more recently in [17], and discuss both tightness and ease of computation of each.
The \({\ell _\infty }\) ball. The \(\ell _\infty \) ball is represented by
The set of extreme points is given by \(V=\left\{ {-1, 1 }\right\} ^n\), which in particular implies that \(|V|=2^n\). Therefore, for large-scale problems, using the representation reduction procedure, which is based on Carathéodory theorem, is crucial in order to obtain a practical implementation.
From the definition of \(\mathbf{A}\) and V, it follows that
and so
Comparing this result with the pyramidal width of the cube, which by [17] is equal to \({2}/{\sqrt{n}}\) for the given two-unit cube, we must recall that the pyramidal width is compared to the \({\Omega _X}/{n}=2/n\) since the latter is actually a lower bound on the former. Thus we lose a factor of \(\sqrt{n}\) using the looser bound \(\Omega _X\). However, looking at the proof in [17], the calculation of the pyramidal width relies strongly on geometrical considerations (e.g., symmetry of the cube), and the properties of the algorithm (the forward step being the maximizer over the chosen direction), while the computation of \(\Omega _X\) is straightforward.
The unit simplex. The unit simplex \(\Delta _n\) can be represented by
The set of extreme points is given by \(V=\left\{ {\mathbf{e}_i}\right\} _{i=1}^n\). Notice that since there are only n extreme points which are all affinely independent, using a rank reduction procedure which implements the Carathéodory theorem is the same as applying the trivial procedure that does not change the representation. In order to calculate \(\Omega _X\), we first note that \(I(V)=\{n+1,n+2\}\), and therefore
and
Comparing this result with the pyramidal width of the unit simplex, which by [17] is given by \({2}/{\sqrt{n}}\), while we show \({\Omega _X}/{n}=1/n\) and again we lose a factor of \(\sqrt{n}\) using the looser bound \(\Omega _X\). The proof in [17] relies on a geometrical result discussing the width of the simplex, which is not easily derived, while the computation of \(\Omega _X\) is a direct result of V being the standard basis in \(\mathbb {R}^n\).
The \({\ell _1}\) ball. The \(\ell _1\) ball is given by the set
Therefore \(\mathbf{a}=\mathbf{1}\in \mathbb {R}^{2^n}\) and each row of the matrix \(\mathbf{A}\in \mathbb {R}^{2^n\times n}\) is a vector in \(\left\{ {-1,1}\right\} ^n\). The set of extreme points is given by \(V=\left\{ {\mathbf{e}_i}\right\} _{i=1}^n\bigcup \left\{ {-\mathbf{e}_i}\right\} _{i=1}^n\), and therefore has cardinality of \(|V|=2n\).
Finally, we have that
and so
The pyramidal width of the \(\ell _1\) ball is obviously smaller than that of the simplex, and it is difficult to calculate via the approach suggested in [17].
A recent work of Peña and Rodríguez [20], which appeared after the submition of this paper, showed that the pyramidal width is equal to a quantity referred to as the facial distance of the polytope. This quantity is actually a refinement of the vertex-facet distance, since it calculates the distance between any face of the polytope and the polytope generated by the convex-hull of the vertices not contained in that face. The vertex-facet distance is the distance from a vertex to the hyperplane containing a single polytope facet, disregarding the additional information about the polytope. Therefore, it follows that the bound generated by the vertex-facet distance can be arbitrarily smaller than the facial distance. However, the combinatorial cost of computing the vertex-facet distance is generally linearly dependent on the the number of vertices and facets, while the facial distance generally requires a computation which is exponentially dependent on the number of vertices (equivalent to the number of polytope faces).
Notes
References
Beck, A., Teboulle, M.: A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods of Oper. Res. 59(2), 235–247 (2004)
Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. In: Palomar, D., Eldar, Y. (eds.) Convex Optimization in Signal Processing and Communications, pp. 139–162. Cambridge University Press, Cambridge (2009)
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont, MA, USA (1999)
Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization, vol. 6. Athena Scientific, Belmont (1997)
Canon, M.D., Cullum, C.D.: A tight upper bound on the rate of convergence of Frank-Wolfe algorithm. SIAM J. Control 6(4), 509–516 (1968)
Dunn, J., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62(2), 432–444 (1978)
Epelman, M., Freund, R.M.: Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system. Math. Program. 88(3), 451–485 (2000)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Quart. 3(1–2), 95–110 (1956)
Freund, R.M., Grigas, P., Mazumder, R.: An extended frank-wolfe method with “in-face” directions, and its application to low-rank matrix completion. arXiv preprint; arXiv:1511.02204 (2015)
Garber, D., Hazan, E.: A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint; arXiv:1301.4666 (2013)
Goldfarb, D., Todd, M.J.: Chapter ii: Linear programming. In: Nemhauser, G., Kan, A.R., Todd, M. (eds.) Optimization, volume 1 of Handbooks in Operations Research and Management Science, pp. 73–170. Elsevier, Amsterdam (1989)
Guelat, J., Marcotte, P.: Some comments on Wolfe’s away step. Math. Program. 35(1), 110–119 (1986)
Güler, O.: Foundations of Optimization. Graduate Texts in Mathematics, vol. 258. Springer, New York (2010)
Hoffman, A.J.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand. 49(4), 263–265 (1952)
Jaggi, M.: Sparse Convex Optimization Methods for Machine Learning. Ph.D. thesis, ETH Zurich (2011)
Lacoste-Julien, S., Jaggi, M.: An affine invariant linear convergence analysis for Frank-Wolfe algorithms. In: NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends (2014)
Lacoste-Julien, S., Jaggi, M.: On the global linear convergence of frank-wolfe optimization variants. In: Advances in Neural Information Processing Systems, pp. 496–504 (2015)
Levitin, E., Polyak, B.T.: Constrained minimization methods. USSR Comput. Math. Math. Phys. 6(5), 787–823 (1966)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)
Pena, J., Rodriguez, D.: Polytope conditioning and linear convergence of the frank-wolfe algorithm. arXiv preprint; arXiv:1512.06142 (2015)
Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46–47(1), 157–178 (1993)
Rockafellar, R.T.: Convex Analysis, 2nd edn. Princeton University Press, Princeton (1970)
Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)
Wolfe, P.: Chapter 1:Convergence Theory in Nonlinear Programming. In: Abadie J. (ed.) Integer and nonlinear programming, pp. 1–36. North-Holland Publishing Company, Amsterdam (1970)
Acknowledgments
The research of Amir Beck was partially supported by the Israel Science Foundation grant 1821/16
Author information
Authors and Affiliations
Corresponding author
Appendix: Incremental representation reduction using the Carathéodory theorem
Appendix: Incremental representation reduction using the Carathéodory theorem
In this section we will show a way to efficiently and incrementally implement the constructive proof of Carathéodory theorem, as part of the VRU scheme, at each iteration of the ASCG algorithm. We note that this reduction procedure does not have to be employed, and instead the trivial procedure, which does not change the representation can be used. In that case, the upper bound on the number of extreme points in the representation is just the number of extreme points of the feasible set X.
The implementation described in this section will allow maintaining a vertex representation set \(U^k\), with cardinality of at most \(n+1\), at a computational cost of \(O(n^2)\) operations per iteration. For this purpose, we assume that at the beginning of iteration k, \(\mathbf{x}^{k}\) has a representation with vertex set \(U^{k}=\left\{ {\mathbf{v}^1,\ldots ,\mathbf{v}^{L}}\right\} \subseteq V\), such that the vectors in the set are affinely independent. Moreover, we assume that at the beginning of iteration k, we have at our disposal two matrices \(\mathbf{T}^k\in \mathbb {R}^{n\times n}\) and \({\mathbf{W}}^k\in \mathbb {R}^{n\times (L-1)}\). We define \(\mathbf{V}^k\in \mathbb {R}^{n\times (L-1)}\) to be the matrix whose ith column is the vector \(\mathbf{w}^i=\mathbf{v}^{i+1}-\mathbf{v}^1\) for \(i=1, \ldots , L-1\), where \(\mathbf{v}^1\) is called the reference vertex. The matrix \(\mathbf{T}^k\) is a product of elementary matrices, which ensures that the matrix \({\mathbf{W}}^k=\mathbf{T}^k\mathbf{V}^k\) is in row echelon form. The implementation does not require to save the matrix \(\mathbf{V}^k\), and so at each iteration, only the matrices \(\mathbf{T}^k\) and \(\mathbf{W}^k\) are updated.
Let \(U^{k+1}\) be the vertex set and let \({\varvec{\mu }}^{k+1}\) be the coefficients vector at the end of iteration k, before applying the rank reduction procedure. Updating the matrices \(\mathbf{W}^{k+1}\) and \(\mathbf{T}^{k+1}\), as well as \(U^{k+1}\) and \({\varvec{\mu }}^{k+1}\), is done according to the following Incremental Representation Reduction scheme, which is partially based on the proof of Carathéodory theorem presented in [22, Sect. 17].
Notice that in order to compute the row rank of the matrix \(\mathbf{W}^{k+1}\) in step 6(d), we may simply convert the matrix to row echelon form, and then count the number of nonzero rows. This is done similarly to step 7, and requires ranking of at most one column. We will need to rerank the matrix in step 7 only if \(L>M\), and subsequently at least one column is removed in step 6(e)vi.
The IRR scheme may reduce the size of the input \(U^{k+1}\) only in the case of a forward step, since otherwise the vertices in \(U^{k+1}\) are all affinely independent. Nonetheless, the IRR scheme must be applied at each iteration in order to maintain the matrices \({\mathbf{W}}^k\) and \(\mathbf{T}^k\).
The efficiency of the scheme relies on the fact that only a small number of vertices are either added to or removed from the representation. The potentially computationally expensive steps are: step 5(b)—replacing the reference vertex, step 6(d)—finding the row rank of \(\mathbf{W}^{k+1}\), step 6(e)i—solving the system of linear equalities, step 6(e)vi—removing columns corresponding with the vertices eliminated from the representation, and step 7—the ranking of the resulting matrix \({\mathbf{W}}^{k+1}\). Step 5(b) can be implemented without explicitly using matrix multiplication and therefore has a computational cost of \(O(n^2)\). Since \({\mathbf{W}}^k\) was in row echelon form, step 6(d) requires a row elimination procedure, similar to step 7, to be conducted only on the last column of \({\mathbf{W}}^{k+1}\), which involves at most O(n) operations and an additional \(O(n^2)\) operation for updating \(\mathbf{T}^{k+1}\). Moreover, since \({\mathbf{W}}^k\) was full column rank, the IRR scheme guarantees that in step 6(e)i the vector \({\varvec{\lambda }}\) has a unique solution, and since \({\mathbf{W}}^{k+1}\) is in row echelon form, it can be found in \(O(n^2)\) operations. Moreover, in step 6(e)i, the specific choice of \(\alpha \) ensures that the reference vertex \(\mathbf{v}^1\) is not eliminated from the representation, and so there is no need to change the reference vertex at this stage. Furthermore, it is reasonable to assume that the set I satisfies \(|I|=O(1)\), since otherwise the vector \(\mathbf{x}^{k+1}\), produced by a forward step, can be represented by significantly less vertices than \(\mathbf{x}^k\), which, although possible, is numerically unlikely. Therefore, assuming that indeed \(|I|=O(1)\), the matrix \(\tilde{\mathbf{T}}\), calculated in step 7, applies a row elimination procedure to at most O(1) rows (one for each column removed from \(\mathbf{W}^{k+1}\)) or one column (if a column was added to \(\mathbf{W}^{k+1}\)). Conducting such an elimination on either row or column takes at most \(O(n^2)\) operations, which may include row switching and at most n row addition and multiplication. Therefore, the total computational cost of the IRR scheme amounts to \(O(n^2)\).
Rights and permissions
About this article
Cite this article
Beck, A., Shtern, S. Linearly convergent away-step conditional gradient for non-strongly convex functions. Math. Program. 164, 1–27 (2017). https://doi.org/10.1007/s10107-016-1069-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-016-1069-4