1 Introduction

In this article we present a robust and efficient algorithm called All Vertex Triangle Algorithm (AVTA). Given input set \(S= \{v_1, \dots , v_n\} \subset \mathbb {R} ^m\), on the one hand without any assumptions, AVTA computes approximation to the subset \(\overline{S}=\{\overline{v}_1, \dots , \overline{v}_K\}\) of all vertices of conv(S) so that the convex hull of the approximate subset of vertices is as close to conv(S) as desired. Specifically, given any \(t \in (0,1)\), AVTA computes a subset \(\overline{S}^t\) of \(\overline{S}\) so that the distance from any point \(p \in conv(S)\) to \(conv(\overline{S}^t)\) is at tR, where R is the diameter of S. On the other hand, under the assumption that we are given a constant \(\gamma \le \varGamma _*/R\), where \(\varGamma _*\) is the minimum of the distances from each vertex to the convex hull of the remaining vertices, it can compute all of \(\overline{S}\). Furthermore, assuming that instead of S the input is an \(\varepsilon \)-perturbation of S, \(\overline{S}_\varepsilon \), where \(\Vert v_i - v^{\varepsilon }_i \Vert \le \varepsilon R\), AVTA can compute approximation to the convex hull of \(\overline{S}_\varepsilon \), to any prescribed accuracy.

AVTA, a fully polynomial-time approximation scheme, builds upon the Triangle Algorithm (Kalantari 2015), designed to solve the the convex hull membership problem. Specifically, given S, the Triangle Algorithm tests if a distinguished point p lies in the conv(S), either by computing a point \(p_\varepsilon \in conv(S)\) to within a prescribed distance to p, or a hyperplane that separates p from conv(S). Before describing AVTA and its applications we wish to give an overview of the related problems and research, as well as their history, significance and connections to our work.

The convex hull membership problem is a basic problem in computational geometry and a very special case of the convex hull problem, see Toth et al. (2004). Besides being a fundamental problem in computational geometry, it is a basic problem in linear programming (LP). In fact LP with integer coefficients can be reduced to a convex hull membership problem. Furthermore, the two most famous polynomial-time LP algorithms, the ellipsoid algorithm of Khachiyan (1980) and the projective algorithm of Kalantari (1984), are in fact explicitly or implicitly designed to solve the convex hull membership problem when \(p=0\), see Jin and Kalantari (2006). Furthermore, using an approach suggested by Jin and Kalantari (2006) shows a direct connection between a general LP feasibility and this homogeneous case of the convex hull membership problem.

An important problem in computational geometry and machine learning is the irredundancy problem, the problem of computing all the vertices of conv(S), see  Toth et al. (2004). Clearly, any algorithm for LP feasibility can be used to solve the irredundancy problem by solving a sequence of O(n) convex hull membership problems. For results that reduce the number of linear programming problems, see e.g.  Clarkson (1994) and  Chan (1996a, b). Some applications require the description of conv(S) in terms of its vertices, facets and adjacencies, see  Chazelle (1993). The complexity of many exact algorithms for irredundancy of finite points set is exponential in terms of the dimension of the points, thus only practical in very low dimensions. On the other hand, the convex hull membership problem by itself has been studied in the context of large scale applications where simplex method or polynomial time algorithms are too expensive to run. Thus approximation schemes have been studied for the problem.

Not only is convex hull detection a fundamental problem in computational geometry, state of the art algorithms for many machine learning problems rely on being able to solve this problem efficiently. Consider for instance the problem of non-negative matrix factorization (NMF) (Lee and Seung 2001). Here, given access to a data matrix A, we want to compute non-negative, low rank matrices U and V such that \(A=UV\). Although in general this problem is intractable, recent results show that under a natural separability assumption, see  Donoho and Stodden (2003), such a factorization can be computed efficiently, see  Arora et al. (2012a). The key insight in these works is that under the separability assumption, the rows of the matrix V will appear among the rows of A. Furthermore, the rows of V will be the vertices of the convex hull of rows of A. Hence, a fast algorithm for detecting the vertices will lead to a fast factorization algorithm as well.

The organization of the the article is as follows. In Sect. 2 we mention and review related work on the irredundancy problem. In Sect. 3 we describe the high level idea of the algorithm and introduce the main results. In Sect. 4, we describe the convex hull membership oracle Triangle Algorithm. This will be used throughout the the article. The efficient implementation of Triangle Algorithm is described in the “Appendix”. In Sect. 5, we describe All Vertex Triangle Algorithm (AVTA), a Algorithm, for computing all vertices of the convex hull of a given finite set of points, S. We discuss several applications of this, in particular in solving the convex hull membership problem itself. Other applications will be described in subsequent sections. In Sect. 6, we consider the performance of AVTA under perturbation of data. In Sect. 7, we consider AVTA with Johnson–Lindenstrauss projections. Furthermore, we consider the performance of AVTA under perturbation of data with Johnson–Lindenstrauss projections.

2 Related work

Convex hull membership query oracles The convex hull membership problem can be formulated as the minimization of a convex quadratic function over the unit simplex. This particular convex program finds applications in statistics, approximation theory, and machine learning, see e.g  Clarkson (2010) and  Zhang (2003) who consider the analysis of a greedy algorithm for minimizing smooth convex functions over the unit simplex. The algorithm of  Frank and Wolfe (1956) is a classic greedy algorithm for convex programming. When the the convex hull of a set of points does not contain the origin, the problem of computing the point in the convex hull with least norm, known as polytope distance is also a problem of interest. In some applications the polytope distance refers to the distance between two convex hulls, a fundamental problem in machine learning, known as SVM, see e.g.  Burges (1998). The algorithm of  Gilbert (1966) for the polytope distance problem is one of the earliest known algorithms.  Gärtner and Jaggi (2009) show Gilbert’s algorithm coincides with Frank–Wolfe algorithm when applied to the minimization of a convex quadratic function over a unit simplex. In this case the algorithm is known as sparse greedy approximation. For many results regarding the applications of the minimization of a quadratic function over a simplex, see  Zhang (2003),  Clarkson (2010) and  Gärtner and Jaggi (2009).  Clarkson (2010) analyzes the Frank–Wolfe and its variations while studying the notion of coresets. While the Triangle Algorithm has features that are very similar to those of Frank–Wolfe algorithm, there are other features and properties that make it an algorithm distinct from Frank–Wolfe or Gilbert’s algorithm. To describe these differences, consider the distance between p and conv(S): \( \varDelta = \min \{ d(p',p) \equiv \Vert p'-p \Vert : \quad p' \in conv(S) \}=d(p_*,p). \)

Clearly, \(p \not \in conv(S)\), if and only if \(\varDelta >0\). The goal of the convex hull membership problems (equivalently an LP feasibility) is to test feasibility, i.e. if p lies in conv(S). Solving this does not require the computation of \( \varDelta \) when it is positive. Thus the goal of solving the convex hull membership is different from that of computing this distance \(\varDelta \)when positive. When \(p \in conv(S)\), the analysis of complexity of the Triangle Algorithm is essentially identical with  Clarkson (2010) analysis of the basic Frank–Wolfe algorithm.  Gärtner and Jaggi (2009) on the other hand analyze the complexity of Gilbert’s algorithm for the polytope distance problem, i.e. the approximation of \(\varDelta \), however under the assumption that \(\varDelta >0\). Gärtner and Jaggi (2009) do not address the case when \(\varDelta =0\).

What distinguishes the Triangle Algorithm from the Frank–Wolfe Algorithm and Gilbert’s algorithms is the distance dualities which gives more flexibility to the algorithm. The algorithm we will analyze in this article, namely AVTA, is designed to generate vertices of conv(S). It makes repeated use of the distance dualities of the Triangle Algorithm, resulting in an over all efficient algorithm for computing the vertices of conv(S), or very good approximation to these vertices, even under perturbation of the input set. Indeed AVTA is testimonial to the uniqueness of the Triangle Algorithm while itself is a nontrivial extension of the Triangle Algorithm. AVTA finds many applications in computational geometry and machine learning. Some of these are demonstrated here theoretically and computationally. We next describe AVTA in more detail.

Irredundancy problem in machine learning A problem related to the irredundancy problem is known as topic modeling (Blei 2012). Here one is given access to a large corpus of documents, with each document represented as a long vector consisting of frequency in the document of every word in the vocabulary. This is known as the bag-of-words representation. Each document is assumed to represent a mixture of up to K hidden topics. A popular generative model for such documents is the following: For every document d, a K dimensional vector \(\theta _d\) is drawn from a distribution over the simplex. Typically this distribution is the Dirichlet distribution. Then, for each word in the document, a topic is chosen according to \(\theta _d\). Finally, given a chosen topic i, a word is output according to the topic distribution vector \(\beta _i\). This is known as the Latent Dirichlet Allocation (LDA) model (Blei et al. 2003). The parameters of this model consist of the topic-word matrix \(\beta \) so that \(\beta _i\) defines the distribution over words for topic i. Additionally, there are hyper parameters associated with the Dirichlet distributions generating the topic distribution vector \(\theta _d\). The topic modeling problem concerns learning the topic-word matrix \(\beta \) and the parameters of the topic generating distribution. Similar to NMF, the problem is intractable in the worst case but can be efficiently solved under separability (Arora et al. 2012b). In this context, the separability assumption requires that for each topic i, there exists an anchor word that has a non-zero probability of occurring only under topic i. Separability is an assumption that is known to hold for real world documents (Arora et al. 2012b). The key component towards learning the model parameters is a fast algorithm for finding the anchor words. The algorithm of  Arora et al. (2012b, 2013) uses the word-word covariance matrix and shows that under separability, the vertices of the convex hull of the rows of the matrix will correspond to the anchor words. Similarly, the work of Ding et al. (2013) shows that finding the vertices of the convex hull of the document-word matrix will also lead to detection of anchor words. Both approaches rely on the vertex detection subroutine. Furthermore, in the case of topic models, the documents are limited in size and this translates to the fact that one is given a perturbation of the set S. The goal is to use this perturbed set to approximate the original vertices \(\overline{S}\). Hence in this application it is crucial that the approach to finding the vertices be robust to noise. Following the setting of Arora et al. (2013) we analyze AVTA under the setting of perturbed irredundancy problem. We show AVTA is able to find the ’robust ’ vertices with guarantees that is related to the geometric properties of given set.

Approximation version of irredundancy problem In addition to the perturbed irredundancy problem,  Blum et al. (2016) consider the approximation version of the irredundancy problem where a bi-criterion algorithm is proposed based on Nearest Neighbor Oracle, called Greedy Clustering, for computing a subset of vertices T satisfying two properties: (i) the Hausdorff distance between conv(T) and conv(S) is bounded above by \((8 t + t^2) R\); (ii) \(|T| =O(K_{opt}/t^2)\) where \(K_{opt}\) is the cardinality of smallest set of points which attains t approximation quality. Since \( T \subset S\), this implies that \(t = \varOmega ((K_{opt}/n)^{1/2})\). The running time of the algorithm is \(O\bigg (\frac{nK_{opt}}{t^2} \bigg (m + \frac{K_{opt}}{t^8} + \frac{K^2_{opt}}{t^4} \bigg ) \bigg ).\) While there is a theoretical bound on the size of T as a polynomial in 1/t, it is inefficient since it uses the Nearest Neighbor Oracle. Indeed, in AVTA, the Triangle Algorithm works as an approximate oracle which achieves great improvement in efficiency. Given that \(\gamma _* = \varGamma _*/R\) is \(\varOmega (t)\), we cannot use fewer than \(|\overline{S}|\) vertices to give an t approximation. This argument shows that in  Blum et al. (2016) \(K_{opt}=|\overline{S}|\). In a general case where \(\gamma _*\) is arbitrarily close to 0, AVTA will find all vertices in \(O(nK_{t}(m+\frac{1}{t^2}))\) time where \(K_t=|\overline{S}^t|\). While we so far have no nontrivial bound on \(K_{t}\), it is known that \(K_{t}\le n\). In this case the complexity of AVTA is \(O(n^2m+ n^2/t^{2})\) and Greedy Clustering requires at least \(O(nm/t^{2}+ n/t^{10})\) to achieve the same accuracy. It could be concluded that there exists regimes that AVTA outperforms Greedy Clustering. It is interesting to observe that AVTA could be used as a pre-processing algorithm for Greedy Clustering. By our analysis, AVTA only detects vertices and will not omit any of them. In case \(n>> K_{t}\), we can use AVTA to delete points inside the convex hull thus reduce the size of the problem for Greedy Clustering. In summary, the two algorithms complement each other.

3 Main results

In this section we describe the high level idea and introduce the main result about AVTA. To describes, AVTA we need to define some parameters. We say conv(S) is \(\varGamma _*\)-robust, if \(\varGamma _*\) is the minimum of the distances from each \(\overline{v}_i \in \overline{S}\) to \(conv(\overline{S} {\setminus } \{\overline{v}_i\})\). Set \(R= \max \{d(v_i,v_j), v_i, v_i \in S\}\), the diameter of S. The AVTA works as follows:

figure a
  1. (1)

    Given any \(t \in (0,1)\), AVTA can compute a subset \(\overline{S}^t\) of \(\overline{S}\) so that the distance of each point in conv(S) to \(conv(\overline{S}^t)\) is at most tR. The corresponding number of operations is

    $$\begin{aligned} O(nK^{(t)}(m+ t^{-2})), \quad K^{(t)}= |\overline{S}^t|. \end{aligned}$$
    (1)
  2. (2)

    If a number \(0 < \gamma \le \varGamma _*/R\) is known, the number of operations of AVTA to computes \(\overline{S}\) is.

    $$\begin{aligned} O(nK(m+ \gamma ^{-2})). \end{aligned}$$
    (2)
  3. (3)

    If only K is known, the number of operations of AVTA to compute \(\overline{S}\) is

    $$\begin{aligned} O(nK(m+ \gamma _*^{-2}))\log \gamma _*^{-1}). \end{aligned}$$
    (3)

In practice the input set may be not S but a perturbation of it, \(S_\varepsilon =\{v^{\varepsilon }_1, \dots , v^{\varepsilon }_n\}\), where \(\Vert v_i - v^{\varepsilon }_i \Vert \le \varepsilon R\). The set of perturbed vertices, \(\overline{S}_\varepsilon = \{ \overline{v}_1^{\varepsilon }, \dots , \overline{v}_K^{\varepsilon }\}\) may differ considerably from the set of actual vertices of \(conv(S_\varepsilon )\). Under a mild assumption on \(\varepsilon \), AVTA computes \(\overline{S}_\varepsilon = \{ \overline{v}_1^{\varepsilon }, \dots , \overline{v}_K^{\varepsilon }\}\). More generally, given any \(t \in (0,1)\), AVTA computes a subset \(\overline{S}_\varepsilon ^t\) of \(\overline{S}_\varepsilon \) so that the distance from any \(p \in conv(S)\) to \(conv(\overline{S}_\varepsilon ^t)\) is at most \((t+\varepsilon ) R\). The complexity of AVTA for this variation of the problem is analogous to the unperturbed case, however it makes use a weaker parameter. We say conv(S) is \(\varSigma _*\)-weakly robust, if \(\varSigma _*\) is the minimum of the distances of each vertex in S to the convex hull of all the remaining points in S. In Fig. 1 we show a simple example where \(\varGamma _*\) and \(\varSigma _*\) are shown for set of eight points.

Fig. 1
figure 1

\(\varGamma _*=\varGamma _1\) and \(\varSigma _* =\varSigma _2\)

We first prove when \(\sigma _*=\varSigma _*/R \ge 4 \varepsilon \), \(\overline{S}_\varepsilon \) is a subset of vertices of \(conv(S_\varepsilon )\) and \(conv(S_\varepsilon )\) is at least \(\varSigma _*/2\)-weakly robust. Using this, we prove

  1. (i)

    Given any \(t \in (0,1)\), AVTA can compute a subset \(\overline{S}_\varepsilon ^t\) of \(\overline{S}_\varepsilon \) so that the distance from each p in conv(S) to \(conv(\overline{S}_\varepsilon ^t)\) is at most \((t+ \varepsilon ) R\). The corresponding number of operations is

    $$\begin{aligned} O(nK_\varepsilon ^t(m+ t^{-2})), \quad K_\varepsilon ^t= | \overline{S}_\varepsilon ^t|. \end{aligned}$$
    (4)
  2. (ii)

    If \(\sigma \le \sigma _*=\varSigma _*/R\) is known to satisfy \(4\varepsilon \le \sigma \), the number of operations of AVTA to computes \(\overline{S}_\varepsilon \), is.

    $$\begin{aligned} O(nK_\varepsilon (m+ \sigma ^{-2})), \end{aligned}$$
    (5)

where \(K_\varepsilon \) is at most the cardinality of the set of vertices of \(S_\varepsilon \).

Clearly \(\varGamma _ * \ge \varSigma _*\), however we prove

$$\begin{aligned} \varSigma _* \ge \frac{\varGamma _*}{R} \rho _*, \end{aligned}$$
(6)

where \(\rho _*\) is the minimum distance between distinct pair of points in S. This allows deriving a lower bound to \(\varSigma _*\) from a known lower bound on \(\varGamma _*\). Thus we can alternatively write

  1. (iii)

    If \(\gamma \le \gamma _*=\varGamma _*/R\) is known satisfying \(4 \varepsilon \le \gamma \rho _*/R\), the number of operations of AVTA to compute \(\overline{S}_\varepsilon \) is.

    $$\begin{aligned} O(nK_\varepsilon (m+ (\gamma \rho _*)^{-2})). \end{aligned}$$
    (7)
  2. (iv)

    If only K is known, where \(4 \varepsilon \le \sigma _*=\varSigma _*/R\), the number of operations of AVTA to computes \(\overline{S}_\varepsilon \) is.

    $$\begin{aligned} O(nK_\varepsilon (m+ \sigma _*^{-2}) \log (\sigma _*^{-1})). \end{aligned}$$
    (8)
Table 1 \(\varGamma _*= \min \{d(\overline{v}_i, conv(\overline{S} {\setminus } \{\overline{v}_i\}))\}\), \(\varSigma _*= \min \{d(\overline{v}_i, conv(S {\setminus } \{\overline{v}_i\}))\}\), \(\rho _* = \min \{d(v_i, v_j), i \not = j\}\)

We also consider the application of AVTA in the recovery of vertices through the projection of S or \(S_\varepsilon \) under a Johnson–Lindenstrauss randomized linear projection \(L : \mathbb {R}^{m} \rightarrow \mathbb {R}^{m'}\). By relating the robustness parameters of conv(U) and \(conv(U_\varepsilon )\), where \(U=L(S)\) and \(U_\varepsilon =L(S_\varepsilon )\), to those of conv(S) and \(conv(S_\varepsilon )\), we derive analogous complexity bounds for probabilistic computation of the vertex set of conv(U) or those of \(conv(U_\varepsilon )\), or an approximation to these subsets for a given \(t \in (0,1)\). Table  1 summarizes the complexities of computing desired sets under various cases.

4 Triangle Algorithm: efficient membership oracle

The Triangle Algorithm described in Kalantari (2015) is a simple iterative algorithm for solving the convex hull membership problem, a fundamental problem in linear programming and computational geometry. Formally, the convex hull membership problem is as follows: Given a set of point \(S=\{v_1, \dots , v_n\} \subset \mathbb {R}^m\) and a distinguished point \(p \in \mathbb {R} ^m\), test if \(p \in conv(S)\). If \(p \not \in conv(S)\), find a hyperplane that separates p from conv(S). If \(p \in conv(S)\), the Triangle Algorithm solves the problem to within prescribed precision by generating a sequence of points inside of conv(S) that get sufficiently close to p.

Definition 1

Given p, S and \(\varepsilon \in (0,1)\), \(p' \in conv(S)\) is an \(\varepsilon \)-approximate solution if for some \(v \in S\),

$$\begin{aligned} d(p',p) \le \varepsilon R, \quad R= \max \{d(v_i, v_j): v_i, v_j \in S\}. \end{aligned}$$
(9)

Given S, p and \(\varepsilon \), Triangle Algorithm(\(S, p, \varepsilon \)) either computes an \(\varepsilon \)-approximate solution, or it computes a hyperplane separating p from conv(S). AVTA to be described later will make repeated use of the this procedure. AVTA may use any other membership oracle that is capable of these tasks. Since the complexities for AVTA will be based on that of the Triangle Algorithm, we briefly describe it and its basic complexities. In the “Appendix” we describe a more efficient complexity. Given \(u,v \in \mathbb {R} ^m\), we interchangeably use \(d(u,v)= \Vert u- v \Vert \). Given a point in conv(S), the Triangle Algorithm searches for a pivot to get closer to p:

Definition 2

Given \(p' \in conv(S)\), called iterate, we call \(v \in S\) a p-pivot (or simply pivot) if

$$\begin{aligned} d(p', v) \ge d(p, v) \iff v^Tp-v^Tp' \ge \frac{1}{2} (\Vert p \Vert ^2 - \Vert p' \Vert ^2). \end{aligned}$$
(10)

Definition 3

Given p, \(p' \in conv(S)\) is a p-witness (or simply witness) if the orthogonal bisecting hyperplane to the line segment \(pp'\) separates p from conv(S). Equivalently, \(p' \in conv(S)\) is a p-witness if and only if \(d(p',v_i) < d(p,v_i)\), for all \(i=1, \dots , n\).

Given a point \(p' \in conv(S)\) that is neither an \(\varepsilon \)-approximate solution nor a witness, the Triangle Algorithm finds a p-pivot \(v \in S\). Then on the line segment \(p'v\) it compute the closest point to p, denoted by \(Nearest(p; p'v)\). It then replaces \(p'\) with \(Nearest(p; p'v)\) and repeats the process. It is easy to show the following,

Proposition 1

If an iterate \(p' \in conv(S)\) satisfies \(d(p', p) \le \min \{d(p, v_i): i=1, \dots , n\}\), and \(v_j\) is a p-pivot, then the new iterate

$$\begin{aligned} p''= & {} Nearest(p;p'v_j) = (1-\alpha )p' + \alpha v_j, \end{aligned}$$
(11)
$$\begin{aligned} \alpha= & {} \frac{(p-p')^T(v_j-p')}{d^2(v_j,p')} = \frac{p^Tv_j -p'^Tv_j - p^T p' + \Vert p' \Vert ^2}{\Vert v_j \Vert ^2 - 2p'^T v_j + \Vert p' \Vert ^2}. \end{aligned}$$
(12)

If

$$\begin{aligned} p'=\sum _{i=1}^n \alpha _i v_i, \quad \sum _{i=1}^n \alpha _i=1, \quad \alpha _i \ge 0, \quad \forall i, \end{aligned}$$
(13)

then

$$\begin{aligned} p''=\sum _{i=1}^n \alpha '_i v_i, \quad \alpha '_j=(1-\alpha )\alpha _j+\alpha , \quad \alpha '_i= (1-\alpha )\alpha _i, \quad \forall i \not =j. \end{aligned}$$
(14)

The correctness of the iterative step of the Triangle Algorithm is due to the following:

Theorem 1

(Distance duality) Exclusively, either for each \(p' \in conv(S)\) there exists \(v_j \in S\) that is p-pivot, or there exists \(p' \in conv(S)\) that is p-witness, i.e. \(d(p',v_i) < d(p, v_i)\). \(\square \)

Theorem 2

Given \(\varepsilon >0\), in \(O(1/\varepsilon ^2)\) iterations the Triangle Algorithm computes a \(\varepsilon \)-approximate solution, or it computes a witness. In particular, if \(p \not \in conv(S)\) and \(\varDelta = \min \{d(x, p) : x \in conv(S) \}\), the number of iterations to compute a witness is \(O ({R^2}/{\varDelta ^2})\). \(\square \)

The straightforward implementation of each iteration of the Triangle Algorithm is easily seen to take O(mn) arithmetic operations. The algorithm can be described as follows:

figure b

Remark 1

In each iteration of the Triangle Algorithm it suffices to have a representation of the iterate \(p'\) in terms of \(v_i\)’s, i.e. \(p'=\sum _{i=1}^n \alpha _i v_i\), where \(\sum _{i=1}^n \alpha _i=1\), \(\alpha _i \ge 0\) for all \(i=1, \dots , n\). It is not necessary to know the coordinates of \(p'\). Rather it is enough to have an array of size n to store the vector of \(\alpha _i\)’s. Then assuming that we have stored \(p^Tv_i\), \(i=1, \dots , n\), we can compute the step size \(\alpha \) [see (12)] and \(p''\) (the new iterate) in O(n) time.

5 All vertex triangle algorithm (AVTA)

Given \(S= \{v_i \in \mathbb {R} ^m: i=1, \dots , n\}\), let R be its diameter, i.e. \(R= \max \{d(v_i, v_j), v_i,v_j \in S\}\). Denote the set of vertices of conv(S) by

$$\begin{aligned} \overline{S} =\{\overline{v}_1, \dots , \overline{v}_K\}. \end{aligned}$$
(15)

A straightforward but naive way to compute \(\overline{S}\) is to test for each \(v_i\) if it lies in \(conv(S {\setminus } \{v_i\})\), to within an \(\varepsilon \) precision. For each \(v_i\) this is an LP-feasibility problem. As usual, given rational inputs, there is an \(\varepsilon _0\) so that any \(\varepsilon \le \varepsilon _0\) gives desired precision. Thus it suffices to call n times the convex hull membership oracle. However, this is inefficient. In what follows we describe AVTA which leverages on Triangle Algorithm as membership oracle. The advantage of Triangle Algorithm is that its Distance Duality allows AVTA to capture extreme points strategically which leads to a more efficient complexity than the straightforward algorithm.

Definition 4

We say conv(S) is \(\varGamma _*\)-robust if

$$\begin{aligned} \varGamma _*= \min \{ d(\overline{v}_i, conv(\overline{S} {\setminus } \{\overline{v}_i\})): i=1, \dots , K \}. \end{aligned}$$
(16)

As an example, given a triangle with vertices \(v_1, v_2, v_3\), \(\varGamma _*\) is the minimum of the distances from each vertex to the line segment determined by the other vertices. Thus if other points are placed inside the triangle \(\varGamma _*\) will not be affected.

The following is immediate from Definition 4.

Proposition 2

Let \(\widehat{S}= \{\widehat{v}_1, \dots , \widehat{v}_N \}\) be a subset of \(\overline{S}\). Suppose conv(S) is \(\varGamma _*\)-robust. Given \(v \in S {\setminus } \widehat{S}\), if for some \(\gamma \le \gamma _* \equiv \varGamma _*/R\) we have \(d (v, conv(\widehat{S})) < \gamma R\), then \(v \not \in \overline{S}\).

Theorem 3

Let \(\widehat{S}= \{ \widehat{v}_1, \dots , \widehat{v}_N\}\) be a subset of \(\overline{S}\). Given \(\gamma \in (0,1)\), consider testing if a given \(v \in S {\setminus } \widehat{S}\) satisfies \(d(v, conv(\widehat{S})) \le \gamma R /2\). Suppose we are given \(p' \in conv( \widehat{S})\) for which \(\Vert p' \Vert ^2\) as well as \(p'^T \widehat{v}_i\), \(i=1, \dots , N\) are computed. Then the number of operations to check if \(d(v, conv(\widehat{S})) \le \gamma R/2\) satisfies

$$\begin{aligned} O \bigg (m K^2+ \frac{K}{\gamma ^2} \bigg ). \end{aligned}$$
(17)

Proof

Proof is immediate from Theorem 14 and the fact that \(N \le K\). \(\square \)

We now describe a modification of the Triangle Algorithm for computing all vertices of conv(S). We call this All Vertex Triangle Algorithm or simply AVTA. Assume conv(S) is \(\varGamma _*\)-robust, where \(\varGamma _*\) may or may not be available. However, assume we have a constant \(\gamma \in (0,1)\) known to satisfy \(\gamma \le \gamma _*=\varGamma _*/R\). AVTA works as follows. Given a working subset \(\widehat{S}\) of \(\overline{S}\), initially of cardinality \(N=1\) (see Proposition 3), a single vertex of S, it arbitrarily selects \(v \in S {\setminus } \widehat{S}\). It then tests via the Triangle Algorithm if \(d(v, conv(\widehat{S})) \le \gamma R/2\). If so, it discards v since by definition of \(\gamma \) it cannot belong to \(\overline{S}\) (see Proposition 2). Otherwise, it computes a v-witness \(p' \in conv(\widehat{S})\). It then sets \(c'=v-p'\) and maximizes \(c'^Tx\) where x ranges in \(conv(S {\setminus } \widehat{S})\). The maximum value coincides with the maximum of \(c'^Tv_i\) where \(v_i\) ranges in \(S {\setminus } \widehat{S}\). If the set of optimal solutions is denoted by \(S'\), then \(conv(S')\) is a face of conv(S). A vertex \(v'\) of \(conv(S')\) is a point in \(S'\) and is necessarily a vertex of conv(S). Such a vertex can be computed efficiently. Having computed a new vertex \(v'\) of conv(S), AVTA replaces \(\widehat{S}\) with \(\widehat{S} \cup \{v'\}\) and the process is repeated. However, if v coincides with \(v'\) AVTA selects a new point in \(S {\setminus } \widehat{S}\). Otherwise, AVTA continues to test if the same v (for which a witness was found) is within a distance of \(\gamma R/2\) of the convex hull of the augmented set \(\widehat{S}\). Also, as an iterate AVTA uses the same witness \(p'\) as a warm up initialization. In doing so each selected \(v \in S\) is either determined to be a vertex itself, or it will continue to be tested if it is lies to within a distance of \(\gamma R/2\) of the growing set \(\widehat{S}\). If within \(\gamma R/2\) distance, it will be discarded before AVTA tests another point. We will describe AVTA more precisely. However, we first prove the necessary results. The following Lemma is trivial and we omit the proof:

Lemma 1

Let \(\widehat{S}= \{ \widehat{v}_1, \dots , \widehat{v}_N\}\) be a subset of \(\overline{S}\). For a given \(v \in S {\setminus } \widehat{S}\) suppose \(p' \in conv(\widehat{S})\) is a v-witness. Let \(c'=v-p'\). Then

$$\begin{aligned} \max \{c'^T x: x \in conv(S {\setminus } \widehat{S}) \}= \max \{c'^T v_i: v_i \in S {\setminus } \widehat{S} \}, \end{aligned}$$
(18)

thus solving this linear program gives a new vertex.

Corollary 1

Let \(c'=v-p'\) be as in Lemma 1, \(p' \in conv( \widehat{S})\) for which \(\Vert p' \Vert ^2\) as well as \(p'^T \widehat{v}_i\), \(i=1, \dots , N\) are computed. Then, \(\max \{c'^T x: x \in conv(S {\setminus } \widehat{S}) \}\) can be computed in O(nK) operations.

Proof

Since \(N \le K\), for each i, \(c'^Tv_i\) can be computed in O(K) operations. \(\square \)

The next theorem is trivial and we omit the proof.

Theorem 4

Let \(S'\) be the set of optimal solutions of \(\max \{c'^T x: x \in S {\setminus } \widehat{S} \}\). Let \(v' \in S'\) be a vertex of \(conv(S')\). Then \(v'\) is a vertex of conv(S), i.e. \(v \in \overline{S} =\{ \overline{v}_1, \dots , \overline{v}_K\}\).

The following shows computing a single vertex of conv(S) is trivial.

Proposition 3

Given any v in S, let Farthest(vS) return a point in S that is farthest from v. Then Farthest(vS) is a vertex of conv(S), hence a member of \(\overline{S}\).

Proof

If Farthest(vS) is not a vertex of conv(S) it can be written as a convex combination of two other points \(v_1, v_2 \in conv(S)\). But then Farthest(vS) is not a vertex of the triangle formed by \(v,v_1,v_2\), a contradiction.

\(\square \)

While Farthest(vS) is a simple procedure to capture a vertex, the set of farthest points of all vertices, in the worst case, could generate at most two vertices. In AVTA, it suffices to start with one vertex using the above procedure which takes O(nm) operations. Next we describe AVTA for computing all vertices of conv(S).

figure c

Remark 2

Here we make remarks about the steps of AVTA. In Step 0 AVTA selects the first vertex. In Step 1 it randomly select a v in \(S {\setminus } \widehat{S}\). In Step 2 AVTA checks if the point v selected in Step 1 is sufficiently close to the convex hull of the current set of vertices, \(\widehat{S}\). If so, in Step 3 v is discarded from further considerations. Otherwise, a v-witness \(p'\) is at hand. Step 4 then uses this witness to compute a direction, \(c'=v-p'\), where the maximization of \(c'^Tx\) gives a subset \(S'\) of S consisting of the optimal solutions. Then a vertex \(conv(S')\) will necessarily be a vertex of conv(S). A vertex of \(conv(S')\) is selected by choosing an arbitrary \(v' \in S'\) and computing its farthest point in \(S'\). It maybe the case that the vertex \(v'\) found in Step 4 coincides with v. Step 5 checks if \(v'=v\) in which case it select a new v in the updated \(S {\setminus } \widehat{S}\) in Step 1 for consideration. Otherwise, when this new vertex \(v'\) is not v itself, in Step 5 in AVTA v is sent back to Step 2 to be reexamined if v is within \(\gamma R/2\) distance of the convex hull of augmented \(\widehat{S}\).

Example 1

We consider an example of AVTA, see Fig. 2. In this example \(S=\{v_1, \dots , v_{11}\}\). Note that the set of vertices is \(\overline{S}=\{v_4,v_{10}, v_6, v_1, v_9, v_2, v_5, v_8\}\). Suppose the current working subset of vertices \(\overline{S}\) consists of \(\widehat{S}=\{v_1, v_9, v_2, v_5\}\) and \(v=v_3\) is randomly selected to be tested if it lies in \(conv(\widehat{S})\). A witness \(p' \in conv(\widehat{S})\) is computed and with \(c'=p'-v\) maximum of \(c'^Tx\) over \(conv(\overline{S} {\setminus } \widehat{S})\) is attained at \(S'=\{v_4, v_7, v_{10}\}\). Subsequently one of the two points \(v_4\) or \(v_{10}\) will become the next vertex to be placed in \(\widehat{S}\).

Fig. 2
figure 2

An example where \( \widehat{S} =\{v_1, v_9, v_2, v_5\}\), then \(v=v_3\) is randomly selected from \(S{ {\setminus }} \widehat{S}\) and is tested if it lies in \(conv(\widehat{S})\). A witness \(p'\) is found. Then using \(c'-v-p'\) the set \(S'=\{v_4, v_7, v_{10}\}\) is computed and one of vertices of \(conv(S')\), i.e. \(v_4\) or \(v_{10}\) is selected for inclusion in \(\widehat{S}\)

The following theorem is one of the main results:

Theorem 5

Let \(S= \{v_1, \dots , v_n\} \subset \mathbb {R} ^m\). Let R be the diameter of S. Let \(\overline{S} =\{\overline{v}_1, \dots , \overline{v}_K\}\) be the set of vertices of conv(S).

  1. (1)

    Given any prescribed \(t \in (0,1)\) in

    $$\begin{aligned} O \bigg (nK^{(t)}(m+ \frac{1}{t^2})\bigg ) \end{aligned}$$
    (19)

    operations AVTA computes a subset \(\overline{S}^t\) of \(\overline{S}\) of size \(K^{(t)}\) so that the distance from each p in conv(S) to \(conv(\widehat{S})\) is at most tR.

  2. (2)

    Further more, suppose that conv(S) is \(\varGamma _*\)-robust and let \(\gamma _*=\varGamma _*/R\). Given \(\gamma \in (0,\gamma _*)\), the arithmetic operations AVTA takes to computes \(\overline{S}\) is:

    $$\begin{aligned} O \bigg (nK(m+ \frac{1}{\gamma ^{2}}) \bigg ). \end{aligned}$$
    (20)
  3. (3)

    If only K is known, the complexity of AVTA to compute \(\overline{S}\) is

    $$\begin{aligned} O\bigg ( \bigg (nK(m+ \frac{1}{\gamma _*^2} ) \bigg ) \log \frac{1}{\gamma _*} \bigg ). \end{aligned}$$
    (21)

Proof

(1) For each input \(t \in (0,1)\), AVTA computes a subset \(\overline{S}^t\) of \(\overline{S}\) with \(K^{(t)}\) to approximate conv(S) with tR precision. Initially in AVTA, the subset \(\widehat{S}\) consists of a single element of \(\overline{S}\). It continues to grow and computes a subset \(\overline{S}^t\) of \(\overline{S}\) with \(K^{(t)}\) elements to ensure \(\forall v \in S\) distance between v and \(conv(\overline{S})\) is at most tR. Since the number of vertices needed to give a tR approximation is at most \(K^{(t)}\), for each \(v \in S {\setminus } \widehat{S}\) the cost of Step 2 in AVTA is \(O(mK^{(t)^2}+ {K^{(t)}}/{\gamma ^{2}})\). (Theorem 14 in “Appendix”). Here, (ii) in Theorem 2 is applied with the fact that \(R/{\varDelta }\le 1/{t}\). The needed inner products in Step 2 are \(\widehat{v}_i^T \widehat{v}_j\). However, these inner products need to be computed only once and since there are at most \(K^{(t)}\) of \(\widehat{v}_i\)’s, these inner products can be computed at the cost of \(O(mK^{(t)^2})\) operations. We can store the values of the inner products in an array. Then we use them again as they arise in subsequent iterations. This kind of storing can be done for other inner products that may need to be computed in the course of the algorithm. When a selected v is within the distance of t/2 to \(conv(\widehat{S})\), Step 3 eliminates it from further considerations since v is well approximated. If v is not eliminated, it either gives rise to a new vertex \(v' \in \overline{S}\), or v is a vertex itself. In either case, in order to identify a new vertex of \(\overline{S}\), after a witness has become available, it requires the minimization of \(c'^Tv_i\) as \(v_i\) ranges over current set of vertices, \(S {\setminus } \widehat{S}\). Since \(c'= v-p'\), \(p'=\sum _{j=1}^N \alpha _j \widehat{v}_j\), where \(N= |\widehat{S}|\), the evaluation of \(c'^T v_i\) requires the computation of \(v^T v_i\), and \(v_i^T \widehat{v}_j\), \(j=1, \dots , N\). This requires O(nm) operations. Since such computation is only required of each vertex in \(\overline{S}\), over all the computation of all \(c'^Tv_i\) requires \(O((n-K^{(t)})m K^{(t)})=O(nm{K^{(t)}})\) operations. These together with Theorem 14 imply that the over all complexity is \(O(m{K^{(t)^2}}+ nmK^{(t)} + n K^{(t)}/ \gamma ^2)\) which is the claimed complexities in (1). Next we prove for each \(p \in conv(S)\), the distance from p to \(conv( \overline{S})\) is at most tR. To begin with, for the ‘missing vertices’, i.e. \(\forall \overline{v}_i \in \overline{S} \backslash \overline{S}^t \), we have \(d(v,conv(\overline{S}^t)) \le tR\). For and \(p \in conv(S)\), we have:

$$\begin{aligned} p= \sum _{i=1}^K \alpha _i \overline{v}_i, \quad \sum _{i=1}^K \alpha _i =1, \quad \alpha _i \ge 0. \end{aligned}$$
(22)

Thus

$$\begin{aligned} d(p,conv(\overline{S}^t)) \le \sum _{i=1}^K \alpha _i d(\overline{v}_i, conv(S^t))\le \sum _{i=1}^K \alpha _i tR=tR \end{aligned}$$
(23)

(2) Since a value \(\gamma \le \gamma _*\) is given, it suffices to apply proof of (1) with \(t=\gamma \). Note that in each time Step 3 eliminates v, v can not be a vertex as v is \(\gamma R/2\) close to \(conv(\widehat{S})\). By the definition of \(\gamma _*\), v can not be a vertex.

(3) When only K is known, we execute AVTA multiple times with different \(\gamma \) which is initialized at 0.5. If we compute K vertices with this estimate of \(\gamma _*=\varGamma _*/R\), we stop. Otherwise, we halve \(\gamma \) and repeat the process. Eventually in \(O(\log (\gamma _*^{-1}))\) calls to AVTA we accumulate all K vertices in \(\overline{S}\). Note each call of AVTA takes \(O \big (nK(m+ \frac{1}{\gamma _*^2} ) \big )\) operations.

\(\square \)

Remark 3

If neither K nor an estimate \(\gamma \) to \(\gamma _*=\varGamma _*/R\) are known, initially we select \(t=0.5\) and with this value of t compute a subset of vertices with \(K^{(t)}\) elements. We can then halve t and repeat the process. Intuitively, if for two consecutive values of t no more vertices are generated we can terminate the process, or decrease t by a factor of four. If \(\varGamma _*\) is not too small we will produce a reasonably good subset of \(\overline{S}\) within a reasonable number of calls to AVTA. In either case we are assured of an approximation of conv(S) according to (1) in Theorem 5.

5.1 Application of AVTA in solving the convex hull membership

Suppose we wish to solve the convex hull membership problem: Test if a particular point p lies in conv(S), \(S=\{v_1, \dots , v_n\}\). This is equivalent to linear programming and thus can be solved with variety of algorithms, including polynomial-time algorithms, the simplex method, Frank–Wolfe, or triangle Algorithm. Whichever algorithm we use, the number n plays a role in the complexity. Thus if we compute the set of vertices of conv(S), \(\overline{S}\), we can then test if p lies in \(conv(\overline{S})\) with K instead of n. This approach may seem to be inefficient, however depending upon the accuracy to which we wish to solve the problem and the size of \(\gamma _*\) it may result in a more efficient algorithm. The next theorem considers the application of Theorem 5 in solving the convex hull membership problem.

Theorem 6

Let \(S= \{v_1, \dots , v_n\} \subset \mathbb {R} ^m\). Let R be the diameter of S. Let \( \overline{S} =\{\overline{v}_1, \dots , \overline{v}_K\}\) be the set of vertices of conv(S). Suppose conv(S) is \(\varGamma _*\)-robust. Given any \(0 < \gamma \le \varGamma _*/R\), the number of operations to test if for a given \(p \in \mathbb {R}^m\) admits an \(\varepsilon \)-approximate solution is

$$\begin{aligned} O \bigg (nmK+ \frac{nK}{\gamma ^2} + \frac{K}{\varepsilon ^2} \bigg ). \end{aligned}$$
(24)

Proof

To test if p admits an \(\varepsilon \)-approximate solution can be achieved by first computing the vertices in S, followed by testing if p admits an \(\varepsilon \)-approximate solution in conv(S). From Theorems 2 and 4 it follows that the total complexity is as claimed. \(\square \)

Remark 4

It is easy to check that for some values of \(\varepsilon < \gamma \) the computations of \(\overline{S}\) followed by testing if p lies in \(conv(\overline{S})\) could be more efficient than solving the convex hull membership without computing \(\overline{S}\). This is especially true when \(K= o(n)\).

6 AVTA under input perturbation

As in the previous section, we assume \(S= \{v_1, \dots , v_n\} \subset \mathbb {R} ^m\), R the diameter of S, and \(\overline{S} =\{ \overline{v}_1, \dots , \overline{v}_K\}\) the set of vertices of conv(S). Assume conv(S) is \(\varGamma _*\)-robust.

As before we wish to compute \(\overline{S}\) or a reasonable subset of it. However, in practice the input set S may be not S but a perturbation of S. This changes the set of vertices, robustness parameter and more. We wish to study perturbations under which we can recover the corresponding perturbation of \(\overline{S}\) and extend AVTA to computing this perturbation.

Definition 5

For a given \(\varepsilon \in (0,1)\) the \(\varepsilon \)-perturbations of S is the set \(S_\varepsilon \) defined as

$$\begin{aligned} S_\varepsilon =\{v^{\varepsilon }_1, \dots , v^{\varepsilon }_n\}, \quad \Vert v_i - v^{\varepsilon }_i \Vert \le \varepsilon R. \end{aligned}$$
(25)

The \(\varepsilon \)-perturbations of \(\overline{S}\) is the set \(\overline{S}_\varepsilon \), denoted by

$$\begin{aligned} \overline{S}_\varepsilon = \{ \overline{v}_1^{\varepsilon }, \dots , \overline{v}_K^{\varepsilon }\}, \end{aligned}$$
(26)

where \(\overline{v}^{\varepsilon }_i\) is the perturbation of \(\overline{v}_i\).

In practice we may be given \(S_\varepsilon \) as opposed to S. The first question that arises is: What is the relationship between the vertices of S and those of \(S_\varepsilon \)? Without any assumptions, the vertices of \(conv(S_\varepsilon )\) could change drastically, even under small perturbations.

Example 2

Consider a triangle with three additional interior points, very close to its vertices. It may be the case that even under small perturbation all six points become vertices, or that the interior points become the new vertices while the vertices become the new interior points. Thus there is a need to make some assumptions before we can say anything about the nature of perturbed points.

We would hope that for appropriate range of values of \(\varepsilon \), \(\overline{S}_\varepsilon \) would at least be a subset of the set of vertices of \(S_\varepsilon \). First we need a definition.

Definition 6

We say conv(S) is \(\varSigma _*\)-weakly robust if

$$\begin{aligned} \varSigma _*= \min \{ d(v, conv(S {\setminus } \{ v\})): v \in \overline{S} \}.\end{aligned}$$
(27)

Example 3

Suppose that S consists of the vertices of a non-degenerate triangle with vertices \(v_1, v_2, v_3\). Suppose one additional point is placed inside the triangle. Then clearly \(\varSigma _* < \varGamma _*\).

More generally we have

Proposition 4

Given \(S=\{v_1, \dots , v_n\}\), we have

$$\begin{aligned} \varSigma _* \le \varGamma _*. \end{aligned}$$
(28)

Other than the inequality in Proposition 4, \(\varSigma _*\) and \(\varGamma _*\) corresponding to the set S may seem unrelated, however in the following theorem we establish a relationship between the two that is useful in the analysis of AVTA for computing \(\overline{S}_\varepsilon \).

Theorem 7

Let S and \(\overline{S}\) be as before. Suppose conv(S) is \(\varGamma _*\)-robust, also \(\varSigma _*\)-weakly robust. Let \(\rho _*=\min \{d(v_i, v_j): v_i, v_j \in S, i \not =j\}\). We have

$$\begin{aligned} \varSigma _* \ge \frac{\rho _* }{R} \varGamma _* = \rho _* \gamma _*. \end{aligned}$$
(29)

Proof

For each vertex \(v \in conv(S)\), let \(\varGamma _v\) be the distance from v to the convex hull of the remaining vertices in S. Specifically,

$$\begin{aligned} \varGamma _v= d(v, conv(\overline{S} {\setminus } \{v\})). \end{aligned}$$
(30)

Also let \(\varSigma _v\) be the distance from v to the convex hull of all other points in S. Specifically,

$$\begin{aligned} \varSigma _v= d(v, conv(S {\setminus } \{v\})). \end{aligned}$$
(31)

Clearly we have,

$$\begin{aligned} \varSigma _v \le \varGamma _v. \end{aligned}$$
(32)

Assume v is a vertex for which \(\varSigma _v < \varGamma _v\). If no such a vertex exists then \(\varSigma _* =\varGamma _*\) (see Fig. 3). Let u be the closest point to v lying in the convex hull of the the other vertices of S. Thus

$$\begin{aligned} \varGamma _v= d(v, u), \quad u \in conv(\overline{S}). \end{aligned}$$
(33)

Let \(H_u\) be the hyperplane orthogonal to the line segment vu, passing through u. By definition of u and Carathéodorey’s theorem u is a convex combination of vertices of conv(S) lying on \(H_u\). Thus for some subset T of \( \overline{S}\) lying on \(H_u\)

$$\begin{aligned} u = \sum _{ \overline{v}_i \in T \subset \overline{S}} \alpha _i \overline{v}_i, \quad \sum _{ \overline{v}_i \in T \subset \overline{S}} \alpha _i =1, \quad \alpha _i \ge 0. \end{aligned}$$
(34)

Figure 3 gives a depiction of this property for a simple example. In the example u is a convex combination of \(\overline{v}\) and \(\overline{v}'\), vertices of conv(S) lying in the intersection of \(H_u\) and conv(S). Consider one of these vertices, say \(\overline{v}\). Moving the hyperplane \(H_u\) parallel to itself toward v, it intersects the line segment uv at a unique point w that lies on a facet of \(conv(S {\setminus } \{v\})\). Such w exists because \(\varSigma _* < \varGamma _*\). In other words, if \(H_w\) is a hyperplane parallel to \(H_u\) passing through w, then the region of conv(S) enclosed between the halfspace defined by \(H_w\) and v contains no point of S in its interior (see shaded area in Fig. 3). This implies

$$\begin{aligned} \varSigma _v \ge d(v,w). \end{aligned}$$
(35)
Fig. 3
figure 3

Given \(v \in \overline{S}\), u is its closet point in \(conv(\overline{S} {\setminus } \{v\})\). \(\overline{v}, \overline{v}'\in \overline{S}\) are vertices of conv(S) lying on \(H_u\), the orthogonal hyperplane to line segment uv at u. u is a convex combination of these vertices. Moving \(H_u\) parallel to itself toward v, it intersects the line segment uv at a unique point w lying on a facet of \(conv(S {\setminus } \{v\})\). Thus interior of shaded region contains no point of S

Now consider the intersection of \(H_w\) and each ray connecting v to \(\overline{v}_i \in T\). Denote this intersection by \(y_i\). In the figure the intersection of \(H_w\) and the ray connecting \(v \overline{v}\) is denoted by y. By definition of w and Carathéodorey’s theorem there must exist a point \(v_j \in S\) lying on \(H_w\). Furthermore, \(v_j\) can be written as a convex combination of all the \(y_i\)’s. Thus may may write

$$\begin{aligned} v_j = \sum _{\overline{v}_i \in T \subset \overline{S}} \beta _i y_i, \quad \sum _{\overline{v}_i \in T \subset \overline{S}} \beta _i=1, \quad \beta _i \ge 0. \end{aligned}$$
(36)

Since by definition of \(\rho _*\), \(d(v,v_j) \ge \rho _*\), at least for one \(y_i\) we must have \(d(v, y_i) \ge \rho _*\). This implies we could assume \(\overline{v}\) was chosen so that the corresponding y satisfies

$$\begin{aligned} d(v, y) \ge \rho _*. \end{aligned}$$
(37)

From similarity of the triangles \(\triangle vu \overline{v}\) and \(\triangle vwy\) we may write

$$\begin{aligned} \frac{d(v,w)}{\varGamma _v} =\frac{d(v, y)}{d(v, \overline{v})}. \end{aligned}$$
(38)

From the definition of R as the diameter of S, \(d(v, \overline{v}) \le R\). From (38), (35) and (37) it follows that

$$\begin{aligned} \varSigma _v \ge d(v,w) \ge \frac{1}{R} d(v, y) {\varGamma _v} \ge \frac{1}{R} \rho _* \varGamma _*. \end{aligned}$$
(39)

This means we have

$$\begin{aligned} \varSigma _* \ge \frac{1}{R} \rho _* \varGamma _*. \end{aligned}$$
(40)

\(\square \)

In what follows we will derive complexity bounds for computing \(\overline{S}_\varepsilon \). These complexities will in particular depend on \(\varSigma _*\) or any lower bound \(\sigma \) on \(\sigma _*=\varSigma _*/R\). Theorem 6 implies that we can choose \(\sigma =\rho _* \varGamma _*/R\).

The following theorem describes a simple condition under which the set of vertices of conv(S) under perturbation remain to be vertices of the perturbed convex hull.

Theorem 8

Let S be as before, R diameter of S. Suppose conv(S) is \(\varSigma _*\)-weakly robust. Suppose \(S_\varepsilon \) is an \(\varepsilon \)-perturbation of S. Let \(\sigma \) be a positive number satisfying \(\sigma \le \sigma _*=\varSigma _*/R\). Assume \(\varepsilon < \sigma /2\). If \(v \in S\) is a vertex conv(S) and \(v^\varepsilon \in S_\varepsilon \) its corresponding \(\varepsilon \)-perturbation, then \(v^{\varepsilon }\) is a vertex of \(conv(S_\varepsilon )\).

Proof

Suppose \(v^{\varepsilon }\) is not a vertex of \(conv(S_\varepsilon )\). Without loss of generality assume \(v=v_1\). Hence, \(v^{\varepsilon }=v^{\varepsilon }_1\). Thus \(v^{\varepsilon } \in conv(S_\varepsilon {\setminus } \{v^{\varepsilon }\})\). We may write

$$\begin{aligned} v^{\varepsilon }=\sum _{i=2}^n \alpha _i v^{\varepsilon }_i, \quad \sum _{i=2}^n \alpha _i=1, \quad \alpha _i \ge 0. \end{aligned}$$
(41)

Set

$$\begin{aligned} u =\sum _{i=2}^n \alpha _i v_i. \end{aligned}$$
(42)

On the one hand we have

$$\begin{aligned} u- v^{\varepsilon } = \sum _{i=2}^n \alpha _i (v_i- v^{\varepsilon }_i). \end{aligned}$$
(43)

Then by the triangle inequality

$$\begin{aligned} \Vert u- v^{\varepsilon } \Vert \le \sum _{i=2}^n \alpha _i \Vert v_i- v^{\varepsilon }_i \Vert \le \sum _{i=2}^n \alpha _i {\varepsilon } R ={\varepsilon } R. \end{aligned}$$
(44)

On the other hand, v is in \(\overline{S}\). Without loss of generality assume \(v= \overline{v}_1\). From this assumption and since by (42) \(u \in conv(\overline{S} {\setminus } \{ \overline{v}_1\})\) we have

$$\begin{aligned} u= \sum _{i=2}^K \gamma _i \overline{v}_i, \quad \sum _{i=2}^K \gamma _i =1, \quad \gamma _i \ge 0. \end{aligned}$$
(45)

Since conv(S) is \(\varSigma _*\)-weakly robust on \(\overline{S}\) and \(\sigma \le \sigma _*= \varSigma _*/R\) we have,

$$\begin{aligned} \Vert u-v\Vert \ge \sigma R. \end{aligned}$$
(46)

However, from (44), the fact that \(\Vert v - v^{\varepsilon } \Vert \le \varepsilon R\) and the triangle inequality we may write.

$$\begin{aligned} \Vert u - v \Vert = \Vert u - v^{\varepsilon } + v^{\varepsilon } - v \Vert \le \Vert u - v^{\varepsilon } \Vert + \Vert v^{\varepsilon } - v \Vert \le \varepsilon R + \varepsilon R = 2 \varepsilon R. \end{aligned}$$
(47)

This contradicts the assumption that \(2\varepsilon < \sigma \). Hence \(v^{\varepsilon }\) is a vertex of \(conv(S_\varepsilon )\). \(\square \)

Remark 5

The theorem implies that if the input to AVTA is \(S_\varepsilon \) instead of S, AVTA will still return at least K vertices. However, the set of vertices of \(conv(S_\varepsilon )\) may have more elements than K, possibly all of \(S_\varepsilon \). Moreover, the weakly robustness parameter \(\varSigma _*\) will change. We thus need to revise AVTA if we wish to extract the subset \(\overline{S}_\varepsilon =\{ \overline{v}^{\varepsilon }_1, \dots , \overline{v}^{\varepsilon }_K\}\) from the set of vertices of \(conv(S_\varepsilon )\).

In what follows we will first show how under a mild assumptions on the relationship between \(\varSigma _*/R\) and \(\varepsilon \), AVTA can compute a subset \(\widehat{S}_\varepsilon \) of the vertices of \(conv(S_\varepsilon )\) containing \(\overline{S}_\varepsilon \) (Theorem 9). We then show how AVTA can efficiently extract from \(\widehat{S}_\varepsilon \) the desired set, namely \(\overline{S}_\varepsilon \). The next lemma establishes a lower bound on the week-robustness of \(conv(S_\varepsilon )\). It also shows how spurious vertices of \(conv(S_\varepsilon )\) are situated with respect to the convex hull of the remaining vertices. This will be used in Theorem 9 in pruning such vertices.

Lemma 2

Suppose conv(S) is \(\varSigma _*\)-weakly robust. Suppose \(\varepsilon < \varSigma _*/2R\). Let \(v^\varepsilon \) be any point in \(\overline{S}_\varepsilon \). Let \(\widehat{S}_\varepsilon )\) be any subset of vertices of \(conv(S_\varepsilon )\) containing \(\overline{S}_\varepsilon )\). Then,

$$\begin{aligned} d(v^\varepsilon , conv(\widehat{S}_\varepsilon )) \ge (\varSigma _*-2 \varepsilon R). \end{aligned}$$
(48)

Moreover let \(\widehat{v}^\varepsilon \) be any (spurious) point in \(\widehat{S}_\varepsilon {\setminus } \overline{S}_\varepsilon \). Then

$$\begin{aligned} d(\widehat{v}^\varepsilon , conv(\widehat{S}_\varepsilon {\setminus } \{\widehat{v}^\varepsilon \})) \le \varepsilon R. \end{aligned}$$
(49)

Proof

By Theorem 8, \(\overline{S}_\varepsilon \) is a subset of vertices of \(conv(S_\varepsilon )\). Given \(v^{\varepsilon } \in \overline{S}_\varepsilon \), let v be the corresponding vertex in \(\overline{S}\). Given \(w^{\varepsilon }\) in \(conv(S_\varepsilon {\setminus } \{v^{\varepsilon }\})\), let w in \(conv(S {\setminus } \{v\})\) be the corresponding point, i.e. defined with respect to the same convex combination of corresponding vertices. Then

$$\begin{aligned} \Vert v - v^{\varepsilon } \Vert \le \varepsilon R, \quad \Vert w - w^{\varepsilon } \Vert \le \varepsilon R. \end{aligned}$$
(50)

From the above it is easy to show

$$\begin{aligned} |d(v, w)- d(v^{\varepsilon }, w^{\varepsilon }) | \le 2\varepsilon R. \end{aligned}$$
(51)

But this implies

$$\begin{aligned} d(v, w) - d(v^{\varepsilon }, w^{\varepsilon }) \le 2\varepsilon R. \end{aligned}$$
(52)

Equivalently,

$$\begin{aligned} d(v, w) - 2\varepsilon R \le d(v^{\varepsilon }, w^{\varepsilon }). \end{aligned}$$
(53)

But \(d(v,w) \ge \sigma _* R= \varSigma _*\). This proves (48).

To prove (49), let \(\widehat{v}\) be the point in S corresponding to \(\widehat{v}^\varepsilon \). We have

$$\begin{aligned} \widehat{v} = \sum _{i=1}^K \alpha _i \overline{v}_i, \quad \sum _{i=1}^K \alpha _i =1, \quad \alpha _i \ge 0. \end{aligned}$$
(54)

Define

$$\begin{aligned} \widehat{w} = \sum _{i=1}^K \alpha _i \overline{v}^\varepsilon _i, \quad \sum _{i=1}^K \alpha _i =1, \quad \alpha _i \ge 0. \end{aligned}$$
(55)

It is now easy to show

$$\begin{aligned} \Vert \widehat{v}^\varepsilon - \widehat{w} \Vert \le \Vert \widehat{v}^\varepsilon - \widehat{v} \Vert +\Vert \widehat{v} - \widehat{w} \Vert \le 2 \varepsilon R. \end{aligned}$$
(56)

This proves (49). \(\square \)

Theorem 9

Let \(S= \{v_1, \dots , v_n \} \subset \mathbb {R}^m\). Assume conv(S) is \(\varSigma _*\)-weakly robust. Suppose \(\varepsilon \le \varSigma _*/ 4R\).

(i) Given any \(t \in (0,1)\), AVTA can be modified to compute a subset \(\overline{S}_\varepsilon ^t\) of the set of vertices of \(conv( S_\varepsilon )\) of cardinality \(K^{(t)}_\varepsilon \) so that the distance from each point in \(conv(S_\varepsilon )\) to \(conv(\overline{S}^t_\varepsilon )\) is at most t. In particular, the distance from each point in conv(S) to \(conv(S^t_\varepsilon )\) is at most \((t+ \varepsilon ) R\). The complexity of the computation of \(\overline{S}_\varepsilon ^t\) is

$$\begin{aligned} O \left( nK^{(t)}_\varepsilon \left( m+ \frac{1}{t^2} \right) \right) . \end{aligned}$$
(57)

(ii) Given \(\sigma \) satisfying, \(4\varepsilon \le \sigma \le \sigma _*=\varSigma _*/R\), AVTA can be modified to compute a subset \(\widehat{S}_\varepsilon \) of the set of vertices of \(S_\varepsilon \) containing \(\overline{S}_\varepsilon \), then compute from this subset \(\overline{S}_\varepsilon \) itself. If \(K_\varepsilon \) is the cardinality of \(\widehat{S}_\varepsilon \), the total number of operations satisfies

$$\begin{aligned} O \left( nK_\varepsilon \left( m+ \frac{1}{\sigma ^{2}} \right) \right) . \end{aligned}$$
(58)

(iii) Given \(\gamma \), satisfying \(4\varepsilon \le \gamma \rho _* \le \varGamma _* \rho _*/R= \gamma _* \rho _*\), AVTA can be modified to compute a subset \(\widehat{S}_\varepsilon \) of the set of vertices of \(S_\varepsilon \) containing \(\overline{S}_\varepsilon \), then compute from this subset \(\overline{S}_\varepsilon \) itself. If \(K_\varepsilon \) is the cardinality of \(\widehat{S}_\varepsilon \) the total number of operations satisfies

$$\begin{aligned} O \left( n K_\varepsilon \left( m+ \frac{1}{ (\rho _* \gamma )^{2}} \right) \right) . \end{aligned}$$
(59)

(iv) Given only K, where \(4 \varepsilon \le \varSigma _*/R\), the number of operations of AVTA to computes \(\overline{S}_\varepsilon \) is.

$$\begin{aligned} O \left( nK_\varepsilon \left( m+ \frac{1}{\sigma _*^{2}}\right) \right) \log \left( \frac{1}{\sigma _*}\right) . \end{aligned}$$
(60)

Proof

By Theorem 8, \(\overline{S}_\varepsilon \) is a subset of vertices of \(conv(S_\varepsilon )\). Let \(\sigma _\circ = (\varSigma _* - 2 \varepsilon R)/R\). Then since \(\varepsilon \le \varSigma _*/4R\), \(\sigma _\circ \ge \varSigma _*/2R\). Then by Lemma 2, for each \(v^\varepsilon \in \overline{S}_\varepsilon \), we have

$$\begin{aligned} d(v^\varepsilon , conv(S^\varepsilon {\setminus } \{ v^\varepsilon \}) \ge \varSigma _*/2R. \end{aligned}$$
(61)

Now consider a modification of AVTA that replaces \(\gamma /2\), by \(\sigma /2\). Such modified AVTA will compute a subset \(\overline{S}_\varepsilon \) of vertices of \(conv(S_\varepsilon )\) that must necessarily contain \(\overline{S}_\varepsilon \). Analogous to Theorem 5, (2), the complexity of this part is as stated in part (ii) of the present theorem.

Now consider \(conv(\widehat{S}_\varepsilon )\) and assume \(v^\varepsilon \) is a vertex of it within a distance of less than \(\sigma / 2\), say \(\sigma /4\). Then by Lemma 2, \(v^\varepsilon \not \in \overline{S}^\varepsilon \). We can thus apply the Triangle Algorithm to remove any vertex of \(conv(\widehat{S}_\varepsilon )\) that is within a distance of less than \(\sigma /2\) of the convex hull of the other vertices in \(conv(\widehat{S}_\varepsilon )\). Again analogous to Theorem 5 the over all complexity of this step is bounded by

$$\begin{aligned} O \bigg (m K_\varepsilon ^2+ \frac{K_\varepsilon ^2}{\sigma _\circ ^{2}} \bigg )= O \bigg (m K_\varepsilon ^2+ \frac{K_\varepsilon ^2}{\sigma ^{2}} \bigg ). \end{aligned}$$
(62)

This is dominated by the complexity of the first part. This proves (i). Proof of (ii) follows from Theorem 7, (29), that \(\gamma \rho _* \le \sigma _*\).

To prove (iv), we start by \(\sigma =1/2\) and run AVTA. Then as previous case prune unwanted vertices. If we end up with \(\overline{S}_\varepsilon \), we are done. If not, we repeat the process with \(\sigma =1/4\) and so on. Eventually we will recover \(\overline{S}_\varepsilon \).

The proof of (i) is analogous to the proof of Theorem 8, part (3). \(\square \)

Remark 6

Ideally, \(K_\varepsilon \) is within a constant multiple of K, in which case the complexities are analogous to those of Theorem 5. In the worst-case \(\widehat{S}_\varepsilon = S_\varepsilon \), i.e. \(K_\varepsilon = n\). On the other hand, ignoring the size of \(K_\varepsilon \), suppose \(\sigma _\circ \ge (\sqrt{n/K}) \varepsilon \), then the complexity of generating the vertices of \(conv(S_\varepsilon )\) is

$$\begin{aligned} O \bigg (nm K_\varepsilon + \frac{nK}{{\varepsilon }^2} \bigg ). \end{aligned}$$
(63)

7 Triangle Algorithm with Johnson–Lindenstrauss projections

Consider again \(S= \{v_1, \dots , v_n\} \subset \mathbb {R} ^m\). We wish to compute the subset \(\overline{S}=\{\overline{v}_1, \dots , \overline{v}_K\}\) of all vertices of conv(S). Johnson–Lindenstrauss lemma (Johnson and Lindenstrauss 1984) allows embedding the n points of S in an \(m'\)-dimensional Euclidean space, where \(\mathbb {R}^{m'}\), \(m' < m\), via a randomized linear map so that the distances between every pair of points in S and those of their images in \(\mathbb {R}^{m'}\) remain approximately the same, with high probability. More specifically, there is a universal constant c such if \(\varepsilon '\) satisfies,

$$\begin{aligned} \frac{c \log n}{m} \le \varepsilon '^2 < 1, \end{aligned}$$
(64)

and \(m' < m\) is an integer satisfying

$$\begin{aligned} m' \approx \frac{c \log n}{\varepsilon '^2}, \end{aligned}$$
(65)

then there exists a randomized linear map \(L: \mathbb {R}^m \rightarrow \mathbb {R}^{m'}\) so that if \(u_i=L(v_i)\), and

$$\begin{aligned} U=L(S)=\{u_1, \dots , u_n\} \subset \mathbb {R}^{m'}, \end{aligned}$$
(66)

then for for each \(i, j \in \{1, \dots , n\}\) we have

$$\begin{aligned} \mathrm{Pr} \bigg (d(v_i,v_j) (1- \varepsilon ') \le d(u_i,u_j) \le d(v_i,v_j) (1+ \varepsilon ') \bigg ) > 1 - \frac{2}{n}. \end{aligned}$$
(67)

The projection of each point takes \(O(m \log n)\) operations so that the overall number of operations to project all the n points is

$$\begin{aligned} O(nm \log n). \end{aligned}$$
(68)

In this section we consider computing \(\overline{S}\), the set of vertices of conv(S) by using the Johnson–Lindenstrauss projections and then computing the set of vertices of conv(U) via AVTA. Let \(\overline{U}\) denote the set of vertices of conv(U) and let its cardinality be \(K'\). First we state some properties of conv(U).

Lemma 3

Given \(v \in S\), \(L: \mathbb {R}^m \rightarrow \mathbb {R}^{m'}\), a randomized linear map, suppose \(u=L(v)\) is a vertex of conv(U). Then v is a vertex of conv(S).

Proof

Suppose v is not a vertex of conv(S). Then \(v=\sum _{i=1}^n \alpha _i v_i\), \(\sum _{i=1}^n \alpha _i=1\), \(\alpha _i \ge 0\), \(i=1, \dots , n\), with some \(0<\alpha _j <1\). By linearity of L we have

$$\begin{aligned} u=L(v)= \sum _{i=1}^n \alpha _i L(v_i) = \sum _{i=1}^n \alpha _i u_i. \end{aligned}$$
(69)

This implies u is not a vertex of conv(U), a contradiction. \(\square \)

The next theorem gives an estimate of the robustness parameters of conv(U) in terms of those conv(S).

Theorem 10

Suppose conv(S) is \(\varGamma _*\)-robust and \(\varSigma _*\)-weakly robust. Let \(U=L(S)\), L a randomized linear map, \(L: \mathbb {R}^m \rightarrow \mathbb {R}^{m'}\). Let \(m'\) and \(\varepsilon '\) be related as in (65). If conv(U) is \(\varGamma _*'\)-robust, \(\varSigma _*'\)-weakly robust, then with probability at least \((1-2/n)\), we have

$$\begin{aligned} \varGamma _*' \ge \varGamma _* (1- \varepsilon '), \quad \varSigma _*' \ge \varSigma _* (1- \varepsilon '). \end{aligned}$$
(70)

Proof

Suppose u is a vertex of conv(U) and \(\widehat{U}\) a subset of its vertices not containing u. Let v and \(\widehat{S}\) be the preimages of u and \(\widehat{U}\) under the linear map L. By Lemma 3v and the elements of \(\widehat{S}\) are all vertices of conv(S). From (67) it is easy to argue that with probability at least \((1-2/n)\) we have

$$\begin{aligned} d(u, conv(\widehat{U})) \ge d(v, conv(\widehat{S}))(1- \varepsilon '). \end{aligned}$$
(71)

The claimed inequalities follow. \(\square \)

From Theorem 10 and Theorem 5 we can state the following:

Theorem 11

Given \(S= \{v_1, \dots , v_n\} \subset \mathbb {R} ^m\) let \(U = L(S) =\{ u_1, \dots , u_n\} \subset \mathbb {R}^{m'}\), L a randomized linear map, \(m'\), \(\varepsilon '\) as before. Let \(\overline{U}= \{ \overline{u}_1, \dots , \overline{u}_{K_{\varepsilon '}} \}\) be the set of vertices of conv(U). Suppose conv(S) is \(\varGamma _*\)-robust and conv(U) is \(\varGamma _*'\)-robust. Then with probability at least \((1-2/n)\),

  1. (1)

    The number of arithmetic operations of AVTA to compute \(\overline{U}\) is

    $$\begin{aligned} O \bigg (nK_{\varepsilon '}(m'+ \frac{nK_{\varepsilon '}{R^2}}{\varGamma _*'^{2}}) \bigg )= O \bigg (\frac{n \log n K_{\varepsilon '}}{\varepsilon '^2}+ \frac{nK_{\varepsilon '}}{\gamma _*^{2}(1 -\varepsilon ')^2} \bigg ). \end{aligned}$$
    (72)
  2. (2)

    Given any prescribed positive \(t \in (0,1)\), AVTA in

    $$\begin{aligned} O \bigg (nK_{\varepsilon '}^t(m'+ \frac{1}{t^2}) \bigg )= O \bigg ( \frac{n \log nK_{\varepsilon '}^t}{\varepsilon '^2}+ \frac{nK_{\varepsilon '}^t}{t^2} \bigg ) \end{aligned}$$
    (73)

    operations can compute a subset \(\overline{U}^t\) of \(\overline{U}\) of size \(K_{\varepsilon '}^t\) so that the distance from each point in conv(U) to \(conv(\overline{U}^t)\) is at most t. \(\Box \)

Remark 7

The results in this section and the above theorem suggest a heuristic approach as an alternative to using AVTA directly to compute all the vertices of conv(S): Compute \(U=L(S)\), the Johnson–Lindenstrauss projection of S under a randomized linear map L. Then apply AVTA to compute all the vertices of conv(U), \(\overline{U}\). This identifies \(|\overline{U}| \le K\) vertices of conv(S). Next move up to the full dimension and continue with AVTA to recover the remaining vertices of conv(S). Alternatively, we can repeat randomized projections and compute the corresponding vertices. We would have to delete duplications which is not difficult, given that we store the computed vertices via their vector of representation of convex combination coefficients. We would expect that when sufficient number of projections are applied all vertices of conv(S) can be recovered. However, in the remaining of the section we analyze the probability that under a random projection, the projection of a vertex of conv(S) is a vertex of the projection.

In what follows we first state a result on Johnson–Lindenstrauss random projections on the convex hull membership problem from Vu et al. (2017). Next we state an alternative result.

Proposition 5

(Vu et al. 2017, Proposition 3.3) Given \(S= \{v_1, \dots , v_n \} \subset \mathbb {R}^m\), \(p \in \mathbb {R}^m\) such that \(p \not \in conv(S)\), let \(d= \min \{d(p,x): x \in conv(S)\}\) and \(D= \max \{ d(p,v_i): i=1, \dots , n\}\). Let \(T: \mathbb {R}^m \rightarrow \mathbb {R}^k\) be a random linear map. Then

$$\begin{aligned} \mathrm{Prob} \bigg ( T(p) \not \in T(conv(S)) \bigg ) \ge 1- 2n^2 e^{-c(\varepsilon ^2 - \varepsilon ^3)k} \end{aligned}$$
(74)

for some constant c (independent of mnkdD) and \(\varepsilon < d^2/D^2\).

Remark 8

Note that \(k= O(\ln n/ \epsilon ^2)= O ( \ln n D^4/d^4)\).

The following is an alternative to Proposition 5 based on the Distance Duality theorem (1) and generally gives a better estimate of \(\varepsilon \), hence a smaller k than Proposition 5.

Theorem 12

Given \(S= \{v_1, \dots , v_n \} \subset \mathbb {R}^m\), \(p \in \mathbb {R}^m\) such that \(p \not \in conv(S)\), let \(d= \min \{d(p,x): x \in conv(S)\}\), \(p_*= \mathrm{argmin} \{d(p,x): x \in conv(S)\}\) and \(D= \max \{ d(p,v_i): i=1, \dots , n\}\). Let

$$\begin{aligned} E= \min \bigg \{ \frac{d(p, v_i)}{d(p_*, v_i)} : i=1, \dots , n \bigg \}. \end{aligned}$$
(75)

Let \(T: \mathbb {R}^m \rightarrow \mathbb {R}^k\) be a random linear map. Then

$$\begin{aligned} \mathrm{Prob} \bigg (T(p) \not \in T(conv(S)) \bigg ) \ge 1- 2n^2 e^{-c\varepsilon ^2 k}, \end{aligned}$$
(76)

for some constant c (independent of mnkdD) and \(\varepsilon < (E-1)/(E+1)\). Furthermore, \({(E-1)}/{(E+1)} > {d^2}/{4D^2}\).

Proof

Since \(p_*\) is the closest point to p in conv(S), it is easy to show that it is a p-witness, i.e.

$$\begin{aligned} d(p_*, v_i) < d(p,v_i), \quad \forall i=1, \dots , n. \end{aligned}$$
(77)

Let \(\overline{p}=T(p)\), \(\overline{p}_*=T(p_*)\), and for \(i=1, \dots , n\), \(\overline{v}_i=T(v_i)\). We now consider the set of \(n+1\) points \(\{v_0=p, v_1, \dots , v_n\}\) and their random projections and find condition on \(\varepsilon \) such that \(\overline{p}_*\) will be an \(\overline{p}\)-pivot with respect to T(conv(S)), probabilistically. By the Johnson–Lindenstrauss Lemma we have,

$$\begin{aligned} \mathrm{Prob} \bigg ( (1- \varepsilon ) d(v_i, v_j) \le d(\overline{v}_i, \overline{v}_j) \le (1+ \varepsilon ) d(v_i, v_j) \bigg ) \ge 1- 2(n+1)^2 e^{-c \varepsilon ^2 k}, \end{aligned}$$
(78)

for some constant c (independent of mnk). From (78) and definition of E, for each \(i=1, \dots , n\) with probability at least \(1- 2(n+1)^2 e^{-c \varepsilon ^2 k}\) we have,

$$\begin{aligned} d(\overline{p}_*, \overline{v}_i) \le (1+ \varepsilon ) d(p_*, v_i) \le \frac{(1+ \varepsilon )}{E} d(p, v_i) \le \frac{(1+ \varepsilon )}{(1- \varepsilon )} \frac{1}{E} d( \overline{p}, \overline{v}_i). \end{aligned}$$
(79)

Note that assuming \(n \ge 2\), \( 1< E < \infty \). We thus restrict \(\varepsilon \) to satisfy

$$\begin{aligned} \frac{(1+ \varepsilon )}{(1- \varepsilon )} \frac{1}{E} < 1. \end{aligned}$$
(80)

Equivalently,

$$\begin{aligned} \varepsilon < \frac{E-1}{E+1}. \end{aligned}$$
(81)

Thus with \(\varepsilon \) satisfying the above, \(\overline{p}_*\) is a witness with high probability.

Next we find a lower bound on the right-hand-side of the above. Since E is finite, \(E=d(p, v_j)/d(p_*, v_j)\) for some j, i.e. \(p_* \not = v_j\). Consider the triangle with vertices p, \(v_j\) and \(p_*\). With \(d(p,p_*)\) and \(d(p,v_j)\) fixed, the maximum value of \(d(p_*, v_j)\) is \(\sqrt{d^2(p,v_i) - d^2(p,p_*)}\). Using this we may write

$$\begin{aligned} E=\frac{d(p, v_j)}{d(p_*, v_j)} \ge \frac{d(p,v_j)}{\sqrt{d^2(p,v_i) - d^2(p,p_*)}} = \frac{1}{\sqrt{1- d^2(p,p_*)/d^2(p,v_j)}}. \end{aligned}$$
(82)

But \(d(p,p_*)=d\) and \(d(p,v_j) \le D\). Thus

$$\begin{aligned} E \ge \frac{1}{\sqrt{1- d^2/D^2}} = \frac{D}{\sqrt{D^2- d^2}}. \end{aligned}$$
(83)

The function \((x-1)/(x+1)\) is monotonically increasing. Thus from (84) we have

$$\begin{aligned} \frac{E-1}{E+1} \ge \frac{D -\sqrt{D^2- d^2}}{D +\sqrt{D^2+ d^2}} = \frac{d^2}{ (D +\sqrt{D^2- d^2})^2} \ge \frac{d^2}{4D^2}. \end{aligned}$$
(84)

\(\square \)

Remark 9

We would expect that \((E-1)/(E+1)\) is generally a larger number than \(d^2/4D^2\). Thus Theorem 12 gives generally a better estimate of \(\varepsilon \) and k than those of Proposition 5. An additional advantage of Theorem 12 is that it shows the applicability of the Triangle Algorithm in solving the convex hull membership problem using random projections.

We now state a corollary of the theorem on computation of all vertices of conv(S).

Corollary 2

Given \(S= \{v_1, \dots , v_n \} \subset \mathbb {R}^m\), suppose conv(S) is \(\varGamma _*\)-robust. Let R be the diameter of S. Suppose \(v_j\) is a vertex of conv(S). Let \(T: \mathbb {R}^m \rightarrow \mathbb {R}^k\) be a random linear map. Then the probability that \(T(v_j)\) is a vertex of T(conv(S)) is at least \(1- 2n^2 e^{-c\varepsilon ^2k}\), for some constant c (independent of mnk) and \(\varepsilon < \gamma _*^2/4\).

Proof

We apply the previous theorem with \(v_j\) as b and considering the probability that under a random projection of \(v_j\) lies in projection of the convex hull of the remaining points. Note that \(d(v_j, conv(S {\setminus } \{v_j\}) \ge \varGamma _*\) and \(\max \{d(v_j, v_i): v_i \in S {\setminus } \{v_j\} \le R\). Thus we can replace for b/D in (84) in the previous theorem by \(\gamma _*=\varGamma _*/R\). Thus we can write \((E-1)/(E+1) \ge {\gamma _*^2}/{4}\). This gives the upper bound on \(\varepsilon \). \(\square \)

7.1 AVTA under perturbation and Johnson–Lindenstrass projection

Let \(S_\varepsilon \) be as before and \(U_\varepsilon \), a subset of \(\mathbb {R}^{m'}\) the perturbation of U. Let \(\overline{U}_\varepsilon \) be the perturbation of \(\overline{U}\). Based on the results in this section and previous complexity bounds we have

Theorem 13

Let \(S= \{v_1, \dots , v_n \} \subset \mathbb {R}^m\). Assume conv(S) is \(\varSigma _*\)-weakly robust. Suppose \(\varepsilon < \varSigma _*/ 4R\). Let \(\sigma _\circ = \overline{(}\varSigma _* - 2 \varepsilon R)/R= \sigma _* - 2 \varepsilon \). Then with probability at least \((1-2/n)\),

  1. (i)

    AVTA can be modified to compute a subset \( \widehat{U}_\varepsilon \) of \(U_\varepsilon \), of cardinality \(K_{\varepsilon \varepsilon '}\) such that it contains \(\overline{U}_\varepsilon \). Then AVTA can compute from this subset \(\overline{U}_\varepsilon \) itself, where the total number of operations satisfies

    $$\begin{aligned} O \left( nm' K_{\varepsilon \varepsilon '}+ \frac{n K_{\varepsilon \varepsilon '}}{\sigma _\circ ^{2}(1 - \varepsilon ')^2} \right) = O \left( \frac{n \log n K_{\varepsilon \varepsilon '} }{\varepsilon ^2}+ \frac{n K_{\varepsilon \varepsilon '}}{\sigma _\circ ^{2}(1 - \varepsilon ')^2} \right) . \end{aligned}$$
    (85)
  2. (ii)

    Given any prescribed positive \(t \in (0,1)\), in

    $$\begin{aligned} O \left( \frac{n \log n K^{(t)}_{\varepsilon \varepsilon '}}{\varepsilon ^2}+ \frac{nK^{(t)}{\varepsilon \varepsilon '}}{\overline{\sigma }_\circ ^2 (1 - \varepsilon ')^2}\right) \end{aligned}$$
    (86)

    operations the modified AVTA can compute a subset \(U^t_\varepsilon \) of \(\overline{U}_\varepsilon \) of size \(K^{(t)}_{\varepsilon \varepsilon '}\) so that the distance from each point in \(conv(U_\varepsilon )\) to \(conv(U^t_\varepsilon )\) is at most t.

8 Applications

While the modified AVTA algorithm comes with theoretical guarantees, in certain cases the algorithm might output many more vertices, \(K_\varepsilon \), than desired. Here we present a practical implementation that always outputs exactly K vertices, provided K is known. When K is unknown, our experiments in the next section reveal that the algorithm can automatically detect a slightly larger set that contains a good approximation to the K vertices of interest. Notice that we want a fast way to detect good approximations to the original vertices of the set S and prune out spurious points, i.e., additional vertices of the set \(S_\varepsilon \). The key insight on top of the AVTA algorithm is the following: If the perturbed set is randomly projected onto a lower dimensional space, it is more likely for an original vertex to still be a vertex than for a spurious vertex. Using this insight the algorithm outlined below runs the modified AVTA algorithm over several random projections and outputs the set of points that appear as vertices in many random projections.

figure d

We now show how AVTA can be used to solve various problems in computational geometry and machine learning.

Application of AVTA in linear programming Consider linear programming feasibility problem of testing if \( P=\{x \in \mathbb {R}^n: Ax=b, x \ge 0\}\) is nonempty, where A is \(m \times n\), \(b \in \mathbb {R}^n\). Suppose n is much larger than m. If we reduce the size of A the problem would be more efficiently solvable, no matter what algorithm we use to solve it.

Proposition 6

Given \( P=\{x \in \mathbb {R}^n: Ax=b, x \ge 0\}\), let conv(A) denote the convex hull of columns of A. Let \(A'\) denote the \(m \times n'\) submatrix A whose columns form the set of all vertices of conv(A). Let

$$\begin{aligned} P'= \{x' \in \mathbb {R}^{n'}: A'x'=b, x' \ge 0\}. \end{aligned}$$
(87)

Then P is feasible if and only \(P'\) is feasible.

Proof

Clearly, if \(P'\) is feasible then P is feasible. Assume P is feasible. Thus for some \(x \in \mathbb {R}^n\), \(x \ge 0\), \(Ax=b\). Denote the columns of A by \(a^{(i)}\). Then each \(a^{(i)}\) is a convex combination of columns of \(A'\). That is, for each \(i=1, \dots , n\), there exists

$$\begin{aligned} \alpha ^{(i)} \in S_{n'}= \{s \in \mathbb {R}^{n'}: \sum _{i=1}^{n'} s_i=1, s \ge 0 \}, \end{aligned}$$
(88)

where

$$\begin{aligned} a^{(i)} = A' \alpha ^{(i)}. \end{aligned}$$
(89)

Thus

$$\begin{aligned} Ax= \sum _{i=1}^n x_i a^{(i)} = \sum _{i=1}^n x_i A' \alpha ^{(i)} = A' \sum _{i=1}^n x_i \alpha ^{(i)}. \end{aligned}$$
(90)

Letting

$$\begin{aligned} x'= \sum _{i=1}^n x_i \alpha ^{(i)}, \end{aligned}$$
(91)

\(A'x'=b\), \(x' \ge 0\). \(\square \)

Proposition 7

Assume \( P=\{x \in \mathbb {R}^n: Ax=b, x \ge 0\}\) is nonempty. Consider the linear program \(\min \{c^Tx : x \in P\}\). Let B be the \((m+1) \times n\) matrix whose first row is \(c^T\) and the remaining rows are A. Let \(B'\) be the \((m+1) \times n'\) matrix whose columns form the vertices of the convex hull of the columns of B. Let \(c'^T\) be the first row of \(B'\) and \(A'\) the remaining \(m \times n'\) submatrix of \(B'\). Then

$$\begin{aligned} \min \{c^Tx: Ax=b, x \ge 0\} = \min \{c'^Tx: A'x'=b, x' \ge 0\}. \end{aligned}$$
(92)

Proof

Consider any feasible solution \(x_0\) of original LP. Then by Proposition 6 the set \(\{c'^Tx' =c^Tx_0, A'x'=b, x' \ge 0\}\) is feasible. This implies the original LP has a finite optimal value if and only if the restricted problem does. In particular, the optimal objective values of the two problems coincide. \(\square \)

The above propositions imply that AVTA has potential applications in the reduction of the LP feasibility or optimization, whether we solve the problem via simplex method or other methods.

AVTA for topic modeling in the presence of anchor words  Arora et al. (2013) provide a practical algorithm for topic modeling with provable guarantees. Their algorithm works under the assumptions that the topic-word matrix is separable. In particular, they assume that corresponding to each topic i, there exists an anchor word \(w_i\) that has a non zero probability of appearing only under topic i. Under this assumption, the algorithm of Arora et al. (2013) consists of two stages: a) find the anchor words, and b) use the anchor words to learn the topic word matrix. The problem of finding anchor words corresponds to finding the vertices of the convex hull of the word-word covariance matrix. They propose an algorithm named fast anchor words in order to find the vertices. Since AVTA works in general setting, we can instead use AVTA to find the anchor words. Additionally, the fast anchor words algorithm needs to know the value of the number of anchor words, as an input. On the other hand, from the statements of Theorems 5 and 9 it is easy to see that AVTA can work in a variety of settings when other properties of the data are known such as the robustness. We argue that robustness is a parameter that can be tuned in a better manner than trying different values of the number of anchor words. In fact, one can artificially add random noise to the data and make it robust up to certain value. One can then run AVTA with the lower bound on robustness as input and let the algorithm automatically discover the number of anchor words. This is much more desirable in practical settings. Our first implementation of AVTA is named AVTA+RecoverL2 that uses AVTA to detect anchor words and then uses the anchor words to learn the topic word matrix using the approach from Arora et al. (2013). AVTA is also theoretically superior than fast anchor words and achieves slightly better run times in the regime when the number of topics is \(o(\log n)\), where n is the number of words in the vocabulary. This is usually the case in most practical scenarios.

AVTA for topic modeling in the absence of anchor words The presence of anchor words is a strong assumption that often does not hold in practice. Recently, the work of  Bansal et al. (2014) designed a new practical algorithm for topic models under the presence of catch words. Catch words for topic i correspond to set \(S_i\) such that its total probability of appearing under topic i is significantly higher than in any other topic. Their algorithm called TSVD recovers much better reconstruction of the topic-word matrix in terms of the \(\ell _1\) error. They also assume that for each topic i, there are a few dominant documents that mostly contain words from topic i. The TSVD algorithm works in two stages. In stage 1, the (thresholded) word-document data matrix is projected onto a K-SVD space to compute a different embedding of the documents. Then, the documents are clustered into K clusters. Under the assumptions mentioned above, one can show that the dominant documents for each topic will be clustered correctly. In stage 2, a simple post processing algorithm can approximate the topic-word matrix from the clustering.

We improve on TSVD by asking the following question: is K-SVD the right representation of the data?. Our key insight is that if dominant documents are present in the topic, it is easy to show that most other documents will be approximated by a convex combination of the dominant topics. Furthermore, the coefficients in the convex combinations will provide a much more faithful low dimensional embedding of the data. Using this insight, we propose a new algorithm that runs AVTA on the data matrix to detect vertices and to approximate each point using a convex combination of the vertices. We then use the coefficient matrix as the new representation of the data that needs to be clustered. Once the clustering is obtained, the same post processing step from Bansal et al. (2014) can be used to recover the topic-word matrix. Our results show that the embedding produced by AVTA leads to much better reconstruction error than of that produced by TSVD. Furthermore, K-SVD is an expensive procedure and very sensitive to the presence of outliers in the data. In contrast, our new algorithm called AVTA+CatchWord is much more stable to noise in the data.

figure e

AVTA for NMF The work of Arora et al. (2012a) showed that convex hull detection can be used to solve the non-negative matrix factorization problem under the separability assumption. We show that by using the more general AVTA algorithm for solving the convex hull problem results in comparable performance guarantee.

9 Applications and experiments

Resources: https://github.com/yikaizhang/AVTA.

9.1 Feasibility problem

In this section, we present experimental results which empirically show when the problem is ’over complete’, AVTA can be a ’shortcut’ solution. In another word, given an \(m\times n\) matrix A as data, where the convex hull of the columns of A, denoted by conv(A), has K vertices, \(K \ll n\). We apply the AVTA to solve 2 classical problems which appear in many applications.

Convex hull membership problem In the experiments, vertices of the convex hull are generated by the Gaussian distribution, i.e. \(v_i \sim \mathcal {N}(0,\mathcal {I}_m), i \in [K]\). Having generated the vertices, the ’redundant’ points \(d_j\) where \(d_j \in conv(S), j\in [n-K]\) are produced using random convex combination \(d_j=\sum _{i=1}^{K} \alpha _i v_i\). Here \(\alpha _i\) are scaled standard uniform random variable where \(\alpha _i\) are scaled so that \(\sum _{i=1}^{K} \alpha _i=1\). Specifically, comparison is by fixing \(K=100\), \(m=50\) and n varying from \(5000\sim 500{,}000\). We compare the efficiency of 4 algorithms on solving this problem: the Simplex method  (Chvatal 1983), the Frank Wolfe Algorithm (FW)  (Jaggi 2013), the Triangle Algorithm (TA)  (Kalantari 2015), and our algorithm on solving the convex hull membership query problem.

Table 2 Running time of convex hull membership (s)

Results on convex hull membership query Table  2 shows when \(n \gg K\), AVTA is more efficient than other algorithms solving the convex hull membership problem. This result supports the output sensitivity property of AVTA.

Linear programming feasibility Linear programming feasibility problem is to find a feasible solution of :

$$\begin{aligned} A\alpha =p, \quad \alpha \ge 0. \end{aligned}$$
(93)

In another word, to test if \(p \in cone(A_j)\) where \(A_j\) are columns of A. In case when A is over complete, any feasible p can be represented using only the generators of cone(A) the set \(\bar{A} \subset A\). By scaling A so that columns of \(AD\; (D_{ii}= \frac{b}{a\cdot A_i})\) are in a \(m-1\) dimensional hyperplane \(\langle a, \alpha \rangle =b\), one can find the generators of cone(A) by finding the vertices of the convex hull of the projected points. This could be done efficiently by AVTA. Suppose we have a linear system A and series of query points p , it is sufficient to run AVTA once for dimension reduction and solve the subproblem \(\bar{A}\alpha '=p, \alpha ' \ge 0\) using simplex method.

We compare the running time of Simplex Method with AVTA+Simplex Method. The generator \(\bar{A}\) is entrywise independent uniform(0, 1) random matrix and the ’overcomplete’ part of the matrix \(\bar{A}^c=A/\bar{A}\) are generated by \(\bar{A}^c=\bar{A}B\) where \(B\in \mathbb {R}^{ K \times (n-K)}\) is entrywise independent uniform(0, 10) random matrix. We set the number of generators \(K=100\), the dimension \(m=50\), and the number of ’redundant’ columns \(n=50{,}000\). We simply set half of the query points feasible and rest infeasible. The feasible points p are generated as \(p=Ax\) where \(x\in \mathbb {R}^{n}\) is entrywise independent uniform(0, 1) random vector and the infeasible points are generated in the same way as generators.

Fig. 4
figure 4

a Running time for algorithms to find a feasible solution b Running time for algorithms to find all vertices

Table 3 Running time of linear programming feasibility (s)

It can be observed from Fig. 4a and Table 3 that the running time of AVTA+Simplex doesn’t have obvious increase while Simplex increases drastically. This suggests the potential applications of AVTA in linear programming feasibility problem.

9.2 Computing all vertices

Compute vertices of convex hull In this section, we compare the efficiency of AVTA with another popular algorithm for finding vertices Quickhull  (Barber et al. 1996). We generate vertices according to a Gaussian distribution \(\mathcal {N}(0,10) ^m\). Having generated K such points, n interior points are generated as convex combination of the vertices, where the weights are generated scaled i.i.d uniform distribution.

Experiment and results In the experiment, we set \(K=100\), \(n=500\) and m varying from \(2 \sim 12\)Footnote 2

The computational results is shown in Table  4. In high dimension \(m \ge 9\), when conv(S) is \(\gamma \) robust for some \(\gamma >0\), the AVTA algorithm successfully find all vertices of the convex hull efficiently while the Quick hull algorithm is stuck by its explosion of complexity in dimension m.

Table 4 Running time (s)

Compute vertices of simplex in high dimension The Fast Anchor Word can be used to detect the vertices of a simplex. In this section, we compare the efficiency of AVTA with Fast Anchor Word when convex hull is a simplex with \(K=50\) and \(m=100\). The number of points in the convex hull n varies from \(100{\sim } 100{,}000 \).

Results of running time in simplex case

The running of efficiency comparison between AVTA and Fast Anchor Word in simplex case is presented in Fig. 5c. In regime \(n\ge 30{,}000\), AVTA has less running time.

Compute vertices with perturbation In this section, we compare the robustness of AVTA with multiple random projections presented in Sect. 8 with Fast Anchor Word  (Arora et al. 2013). Instead of actual set of points S as input, the algorithm is given a perturbed set \(S_\circ \), i.e. S is corrupted by some noise. Having fixed \(K=100\), \(n=500\), \(m=100\) , we choose a Gaussian perturbation from \(\mathcal {N}(0,\tau ) ^m\) where \(\tau \) varies from 0.3 to 3. In case of general convex hull, a failure of Fast Anchor Word on computing vertices of general convex hull is presented. The data is generated by setting \(\tau =0.3\), \(m=50\), \(n=500\) and let K varies from \(10 \sim 100\). We do an error analysis and evaluate the output of the algorithms by measuring the \(l_2\) distance between true vertices and the convex hull of output vertices of the two algorithms. More precisely, given a true vertex \(v_i \in S\) and \(\widehat{S}\), the output of an algorithm, the error in recovering \(v_i\) is defined to be \(\min \nolimits _{u\in conv(\widehat{S})} \; ||u-v_i||_2.\) We add up all the errors to get the total accumulated error.

Results on computing perturbed vertices

The recovery error in robustness comparison is shown in Table 5. The AVTA with multiple random projection has a better recovery error in the simplex case.

It can also be observed from Fig. 5b that in general case, as number of vertices exceeds the number of dimensions, Fast Anchor Word fails to recover more vertices and its error explodes.

Fig. 5
figure 5

a Recovery error-computing vertices (simplex case) b Recovery error-computing vertices (eneral convexhull) c Running time (secs) of computing vertices of simplex d \(\ell _1\) error in the semi-synthetic dataset e \(\ell _1\) error in the perturbed semi-synthetic dataset f Range of the \(\ell _1\) error over 10 runs on the noisy semi-synthetic dataset

9.3 Topic modeling

We compare our algorithms with the Fast Anchor + Recoverl2 algorithm of Arora et al. (2013) and the TSVD algorithm of Bansal et al. (2014) on two types of data sets: semi-synthetic data and real world data. We next describe our methodology and empirical results in detail.

Semi synthetic data For Semi-Synthetic data set, we use similar methodology as in Arora et al. (2013). We first train the model on real data set using Gibbs sampling with 1000 iterations. We choose 50 as the number of topics which follows  Bansal et al. (2014). Given the parameters learned from dataset, we generate documents with \(\alpha \) set to be 0.01. The average document length is 1000. Then the reconstruction error is measured by the \(l_1\) distance of bipartite matched pairs between the true word-topic distribution and the word-topic distribution (Arora et al. 2013). We then average the errors to compute the final mean error.

Real data We use the NIPS data set with 1500 documents , and a pruned vocabulary of 2K words, and the NYTimes Corpus with sub sampled 30,000 documents, and a pruned vocabulary of 5k words.Footnote 3 For the real world data set, as in prior works (Arora et al. 2013; Bansal et al. 2014), we evaluate the coherence to measure topic quality (Yao et al. 2009). Given a set of words \(\mathcal {W}\) associated with a learned topic, the coherence is computed as: \(Coherence(\mathcal {W})= \sum _{w_1,w_2 \in \mathcal {W}} \log \frac{D(w_1,w_2)+\epsilon }{D(w_2)}\), where \(D(w_1)\) and \(D(w_1,w_2)\) are the number of documents where \(w_1\) appears and \((w_1,w_2)\) appear together respectively (Arora et al. 2013), and \(\varepsilon \) is set to 0.01 to avoid \(w_1,w_2\) that never co-occur (Stevens et al. 2012). The total coherence is the sum of the coherence of each topic. In the NIPS dataset, 1000 out of the 1500 documents were selected as the training set to learn the word-topic distributions. The rest of the documents were used as the testing set.

Implementation details We compare 4 algorithms, AVTA + CatchWord, TSVD, the Fast Anchor + Recoverl2 and the AVTA + Recoverl2. We implement our own version of Fast Anchor + Recoverl2 as described in Arora et al. (2013). TSVD is implemented using the code provided by the authors in Bansal et al. (2014). AVTA+Recoverl2 corresponds to using AVTA to detect anchor words from the word-word covariance matrix and then using the Recoverl2 procedure from Arora et al. (2013) to get the topic-word matrix. AVTA + CatchWord corresponds to finding the low dimensional embedding of each document in terms of the coefficient vector of its representation in the convex hull of the vertices. The next step is to cluster these points. In practice, one could use the Lloyd’s algorithm for this step which could be sensitive to initialization. To remedy this, we use similar heuristic as  Bansal et al. (2014) of the initialization step. We repeat AVTA for 3 times and pick the set of vertices with highest quality where the quality is measured by sum of distances of each vertex to convex hull of other vertices. We set the number of output vertices \(K=50\) which is the same as the number of topics. i.e. each vertex corresponds to a topic. We found that initializing by simply assigning clusters using neighborhoods of highest degree vertices works effectively. As a final step, we use the post processing step from Bansal et al. (2014) to recover the topic-word matrix from the clustering.

Table 5 Recovery error (simplex)

Robustness We also generate perturbed version of the semi synthetic data. We generate a random matrix with i.i.d. entries uniformly distributed with different scales varying from 0.005–0.05. We test all the algorithms with the document-word matrix added with the noise matrix.

Results on semi synthetic data Figure 5d and e show the \(\ell _1\) reconstruction of all the four algorithms under both clean and noisy versions of the semi synthetic data set. For topic i, let \(A_i\) be the ground truth topic vector and \(\widehat{A}_i\) be the topic vector recovered by the algorithm. Then the \(\ell _1\) error is defined as \(\frac{1}{K} \sum _{i=1}^K \Vert A_i - \widehat{A}_i\Vert _1\). The plots show that AVTA+CatchWord is consistently better than both TSVD and Fast Anchor + Recoverl2 and produces significantly more accurate topic vectors. In order to further test the robustness of our approach, we plot in Fig. 5f the range of the \(\ell _1\) error obtained across multiple runs of the algorithms on the same data set. The range is defined to be the difference between the maximum and the minimum error recovered by the algorithm across different runs. We see that AVTA+CatchWord produces solutions that are much more stable to the effect of the noise as compared to other algorithms. Table 6 shows the running time of the experiments of 4 algorithms. As can be seen, when using AVTA to learn topic models via the anchor words approach, our algorithm has comparable run time to Fast Anchor + Recoverl2. In CatchWord based learning, computing vertices is expensive compared to K-SVD step of TSVD thus AVTA has longer running time.

Table 6 Running time of algorithms on semi synthetic data (secs)

Results on real data Table 7 shows the topic coherence obtained by the algorithms. One can see that in both the approaches, either via anchor words or the clustering approach, AVTA based algorithms perform comparably to state of the art methods.Footnote 4 The running time is presented in Table 8. The AVTA+CathchWord has less running time in the real data experiments. Per our observation, the convex hull of word-document vectors in real data set has more vertices than K, the number of topics. The AVTA catches K vertices efficiently due to its small number of iterations on line search for \(\gamma \). In semi-synthetic data set, the number of ’robust’ vertices is approximately the same as number of topics K thus AVTA needs to find almost all vertices. To catch enough vertices, AVTA needs several iterations decreasing \(\gamma \) which is computationally expensive.

Table 7 Topic coherence on real data
Table 8 Running time on real data experiments (s)
Fig. 6
figure 6

a An example of swimmer images b An example of spurious actions in swimmer images c Output of NMF + RecoverL2 d Output of AVTA + RecoverL2

9.4 Non-negative matrix factorization

AVTA for NMF For our experiments on NMF we use the Swimmer data set (Donoho and Stodden 2003) that consists of 256 swimmer figures with each a \(32 \times 32\) binary pixel images. One can interpret each image as a document and pixels as a word in the document (Ding et al. 2013). All swimmers consist of 4 limbs with each limb having 4 different possible poses. One can then consider the different poses of limbs as the true underlying topics (Donoho and Stodden 2003). We compare the algorithm proposed in Arora et al. (2012a) with AVTA+Recoverl2 on the swimmer data set. We construct a noisy version by adding spurious poses to original swimmer data set. Let \(\varOmega (A)\) be a function that outputs a randomly chosen \(32 \times 8\) block of an image. We generate a ’spurious pose’ of size \(32\times 8\) by \(\varOmega (M_i)\) where \(M_i\) is a randomly chosen swimmer image. Then we take another randomly chosen image \(M_j\) and compute the corrupted image as \(M'_j= M_j + c \cdot \varOmega (M_i)\) where we simply set \(c=0.1\). An illustration of the noise data set is shown in Fig. 6b. Since the true underlying topics are known, we will plot the output of the algorithms and compare it with the underlying truth.

Results on NMF We compare the performance of AVTA on these data sets with the performance of the Separable NMF algorithm proposed in Arora et al. (2012a). Figure 6c and d show the output of the Separable NMF algorithm and that of our algorithm respectively on the noisy data set. Our approach produces competitive results as compared to the Separable NMF algorithm.

10 Conclusion

In this work we have presented a fast and robust algorithm for computing the vertices of the convex hull of a set of points. Our algorithm efficiently computes the vertices of convex hulls in high dimensions and even in the special case of the simplex is competitive with the state of the art approaches in terms of running time (Arora et al. 2013). Furthermore, our algorithm leads to an improved algorithm for topic modeling that is more robust and produces better approximations to the topic-word matrix. It will be interesting to provide theoretical claims supporting this observation in the context of specific applications. Furthermore, we believe that our algorithm will have more applications in machine learning problems beyond the ones investigated here as well as applications in computational geometry and in linear programming.