Nearly linear-time packing and covering LP solvers

Allen-Zhu, Zeyuan; Orecchia, Lorenzo

doi:10.1007/s10107-018-1244-x

Nearly linear-time packing and covering LP solvers

Achieving width-independence and -convergence

Full Length Paper
Series A
Published: 14 February 2018

Volume 175, pages 307–353, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Mathematical Programming Submit manuscript

Nearly linear-time packing and covering LP solvers

Download PDF

Zeyuan Allen-Zhu¹ &
Lorenzo Orecchia²

971 Accesses
12 Citations
Explore all metrics

Abstract

Packing and covering linear programs (PC-LP s) form an important class of linear programs (LPs) across computer science, operations research, and optimization. Luby and Nisan (in: STOC, ACM Press, New York, 1993) constructed an iterative algorithm for approximately solving PC-LP s in nearly linear time, where the time complexity scales nearly linearly in N, the number of nonzero entries of the matrix, and polynomially in $\varepsilon $, the (multiplicative) approximation error. Unfortunately, existing nearly linear-time algorithms (Plotkin et al. in Math Oper Res 20(2):257–301, 1995; Bartal et al., in: Proceedings 38th annual symposium on foundations of computer science, IEEE Computer Society, 1997; Young, in: 42nd annual IEEE symposium on foundations of computer science (FOCS’01), IEEE Computer Society, 2001; Koufogiannakis and Young in Algorithmica 70:494–506, 2013; Young in Nearly linear-time approximation schemes for mixed packing/covering and facility-location linear programs, 2014. arXiv:1407.3015; Allen-Zhu and Orecchia, in: SODA, 2015) for solving PC-LP s require time at least proportional to $\varepsilon ^{-2}$. In this paper, we break this longstanding barrier by designing a packing solver that runs in time $\widetilde{O}(N \varepsilon ^{-1})$ and covering LP solver that runs in time $\widetilde{O}(N \varepsilon ^{-1.5})$. Our packing solver can be extended to run in time $\widetilde{O}(N \varepsilon ^{-1})$ for a class of well-behaved covering programs. In a follow-up work, Wang et al. (in: ICALP, 2016) showed that all covering LPs can be converted into well-behaved ones by a reduction that blows up the problem size only logarithmically.

$\ell _1$-Sparsity approximation bounds for packing integer programs

Article 13 February 2020

Packing Interdiction and Partial Covering Problems

How Good Are Sparse Cutting-Planes?

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A packing linear program (LP) takes the form $\max \{c^T x \,:\, Ax \le b\}$ where $c \in \mathbb {R}_{\ge 0}^n$, $b \in \mathbb {R}_{\ge 0}^m$, and $A \in \mathbb {R}_{\ge 0}^{m \times n}$. A covering LP can be written as $\min \{b^T y\,:\, A^T y \ge c\},$ with the same requirements on A, b, and c. We denote by N the number of non-zero elements in matrix A. We assume without loss of generality that the two LP programs are in their standard forms:

$$\begin{aligned} \text {Packing LP:}&\quad \max _{x \in \mathbb {R}_{\ge 0}^n } \{ {\mathbbm {1}}^{T} x \,:\, Ax \le {\mathbbm {1}}\} , \end{aligned}$$

(1.1)

$$\begin{aligned} \text {Covering LP:}&\quad \min _{y \in \mathbb {R}_{\ge 0}^m } \{ {\mathbbm {1}}^{T} y \,:\, A^T y \ge {\mathbbm {1}}\} . \end{aligned}$$

(1.2)

The two programs are dual to each other, so we denote by $\mathsf {OPT}\ge 0$ their shared optimum. We say x is a $(1-\varepsilon )$-approximation for the packing LP if $Ax \le {\mathbbm {1}}$ and ${\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}$, and y a $(1+\varepsilon )$-approximation for the covering LP if $A^T y \ge {\mathbbm {1}}$ and ${\mathbbm {1}}^{T} y \le (1+\varepsilon )\mathsf {OPT}$.

In this paper, we study first-order iterative methods for solving packing and covering linear programs (PC-LP s) efficiently.^{Footnote 1} Of course, it is possible to adopt the Interior Point or Ellipsoid methods to obtain approximate solvers with a $\log (1/\varepsilon )$ dependence on the number of iterations. However, the computational cost of such algorithms is typically high, as each iteration requires solving a linear system, and thus is not suitable for large-scale applications.

To address this issue, researchers have developed iterative approximatePC-LP solvers that achieve a better dependence on the problem size (e.g., nearly linear in N) at the cost of having a $\textsf {poly}(1/\varepsilon )$ dependence on the approximation parameter $\varepsilon $. Such iterative solvers have been widely applied in approximation algorithms (e.g., MinSetCover [24], MaxSet, MaxDiCut, Max-k-CSP [32], bipartite matching), probabilistic checkable proofs [32], zero-sum matrix games [29], scheduling [31], graph embedding [31], flow controls [10, 11], auction mechanisms [37], wireless sensor networks [14], and many other areas. In addition, techniques developed in this line of research have inspired important results on other fundamental algorithmic problems, such as the design of fast algorithms for multi-commodity flow problems [9, 18, 19, 25, 31] and the equivalence between QIP and PSPACE [21].

Table 1 Comparisons among iterative approximate solvers for packing and covering LPs

Full size table

Previous iterative approximate solvers can be divided into two classes, width-dependent and width-independent solvers (see also Table 1).

Width-dependent solvers^{Footnote 2} Based on multiplicative weight update ideas (a.k.a. exponentiated gradient updates), researchers have obtained solvers for PC-LP s with a running time at least N multiplied with $\rho \mathsf {OPT}\in [1,\infty )$, where $\rho $ is the width of the program, i.e., the largest entry of matrix A. For instance, PC-LP s can be solved in $O(\frac{N \rho ^2 \mathsf {OPT}^2 \log m }{\varepsilon ^2})$-time [31], or $O(\frac{N \rho \mathsf {OPT}\log m }{\varepsilon ^2})$-time using some more refined analysis [7]. These algorithms only require “oracle-access” to the matrix A. When A is given explicitly like in this paper, the running time can be reduced to $O(\frac{N \rho \mathsf {OPT}\log m}{\varepsilon })$ by deploying Nesterov’s accelerated gradient method [29], or Nemirovski’s mirror prox method [26].

Width-dependent algorithms are not polynomial time but only pseudo-polynomial time.

Width-independent, but super linear-time solvers Researchers also tried to appropriately scale the matrix so as to avoid the width penalty in the above methods. For instance, Bienstock and Iyengar [13] built on Nesterov’s method [29] and obtained a running time $O(\varepsilon ^{-1} N \sqrt{Kn\log m})$ where K is the maximum number of non-zeros per row of A. This is $O(\varepsilon ^{-1} N n \sqrt{\log m})$ in the worst case. The results of [15, 27] improved this complexity (for packing LP only) to $\widetilde{O}(\varepsilon ^{-1} N \sqrt{n})$, at a cost of enduring an $\widetilde{O}(N n)$-time preprocessing stage.

Width-independent, nearly linear-time solvers Perhaps the most desirable complexity is a running time that is both independent of the width parameter $\rho $, and also nearly linearly scales with N.^{Footnote 3} This line of research was initiated by a seminal paper of Luby and Nisan [24], who gave an algorithm running in $O\big (\frac{ N \log ^2 N }{\varepsilon ^4}\big )$ time with no dependence on the width $\rho .$ This is also the first nearly linear-time approximate solver for PC-LP s, and also the first to run in parallel in nearly linear-work and polylogarithmic depth.

The parallel algorithm of Luby and Nisan was extended by [4, 8, 10, 33, 35]. Most notably, the algorithm of Wang et al. [33] runs in $O(\frac{\log ^2 N \log (1/\varepsilon )}{\varepsilon ^2})$ iterations, each costing a matrix-vector multiplication that can be implemented in O(N) total work.

The ideas of Luby and Nisan also led to sequential width-independent, nearly linear-time PC-LP solvers [10, 11, 23, 35, 36]. Most notably, the algorithm of Koufogiannakis and Young [23] runs in time $O\big (N + \frac{ \log N }{\varepsilon ^2} \times (n + m)\big )$.

Despite the amount of work in this area, the $O(1/\varepsilon ^2)$ convergence rate was established in 1997 [10, 11] and has not been improved since then. On a separate note, Klein and Young [22] showed that all Dantzig-Wolfe type algorithms have to suffer from a $O(1/\varepsilon ^2)$ convergence rate. This lack of progress constitutes a significant limitation, as the $\varepsilon ^{-2}$-dependence (also known as the $1/\sqrt{T}$ convergence) on the approximation parameter $\varepsilon $ is particularly poor.

1.1 Our results

Packing LP We present an algorithm $PacLPSolver$ that runs in $O(\frac{\log (nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon } N) $ total time. This gives the first width-independent, and the first nearly linear-time solver for packing LP with an $\varepsilon ^{-1}$ convergence (i.e., an 1 / T convergence). In contrast, no nearly linear-time algorithm has achieved any convergence rate faster than $\varepsilon ^{-2}$ before our work.

Interestingly, the maximum (weighted) bipartite matching is just one instance of a packing LP. As a consequence, our $PacLPSolver$ algorithm finds an approximate maximum bipartite matching in time $\widetilde{O}(m \varepsilon ^{-1})$. This new matching algorithm, which arises purely from convex-optimization arguments, matches the running time of the best known combinatorial algorithm for maximum weighted bipartite matching [16].

Covering LP A symmetric design of $PacLPSolver$ gives rise to an algorithm $CovLPSolver^{\mathsf {wb}}$ with the same running time $O(\frac{\log (nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon } N)$, but only solving well-behaved covering LP instances. At a high level, we say an instance is well-behaved if the constraint $A^T y \ge {\mathbbm {1}}$ is “never redundant”: for instance, if the optimal solution $y^*$ satisfies $C \cdot {\mathbbm {1}}\ge A^T y^* \ge {\mathbbm {1}}$ for some constant $C>1$ then the covering LP is well-behaved. For the general covering LP without well-behavior assumptions, we propose a different algorithm $CovLPSolver$ that runs in time $O(\frac{\log (nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon ^{1.5}} N)$. Again, we emphasize that no nearly linear-time covering LP solver can achieve a convergence rate faster than $\varepsilon ^{-2}$ (or equivalently $O(1/\sqrt{T})$) before our work.

Remark After the first version of this paper appeared on arXiv in 2014, Wang, Rao and Mahoney [34] showed all covering LPs can be converted into well-behaved ones, by blowing up the problem size logarithmically. In other words, they obtained a nearly linear-time covering LP solver with $\varepsilon ^{-1}$ convergence by a reduction to $CovLPSolver^{\mathsf {wb}}$. Nevertheless, our $CovLPSolver$, being a direct method, may still be of practical and theoretical interests.

1.2 Main challenge and our approach

Width-independence versus acceleration Previous solvers for PC-LP s are based on standard techniques in non-smooth optimization. They first implicitly or explicitly smoothen the objective, often by the entropy regularizer. Then, they minimize the resulting convex objective either via variations of full-gradient methods, yielding parallel algorithms, or via variations of coordinate-gradient methods, yielding sequential algorithms. The main challenge in previous work is to show that the width dependence can sometimes be completely removed for PC-LP s, if the underlying minimization method is designed cleverly.

Of course, the slower the convergence rate is, the easier it is to design nearly linear-time solvers. The $\varepsilon ^{-4}$-convergence solver of Awerbuch and Khandekar [8] and the $\varepsilon ^{-3}$-convergence solver of [4] are arguably the simplest nearly linear-time solvers at this point.

In this paper, we achieve the $\varepsilon ^{-1}$ convergence that is typical for accelerated gradient descent over smoothened objectives [29], but without paying the width or any additional super-logarithmic factors. The challenge in this approach is to preserve the width-independence and the accelerated rate at the same time. We stress here that our algorithm is not an instance of any known variant of accelerated gradient descent.^{Footnote 4} Moreover, the incorporation of width-independence and Nesterov’s acceleration requires significant effort, as witnessed by the lack of progress on this problem for the last 15 years.

Our high-level approach Our approach is based on an improved convex formalization f(x) of the PC-LP objective, together with our linear-coupling framework for designing efficient first-order methods [5] for minimizing f(x).

The improved formalization shows that our smoothened objective f(x) satisfies either the classical condition for Lipschitz smoothness or a different condition based on multiplicative change. This formalization also clarifies why width-independent algorithms exist in the first place. See Lemma 2.7 and the related discussion for more details.

The linear-coupling framework in our previous work [5] provides a different interpretation of Nesterov’s acceleration for smooth optimization [28]. In a nutshell, this linear-coupling framework allows us to construct accelerated algorithms by coupling the executions of a gradient descent algorithm, yielding iterates $\{\mathsf {y}_k\}$ and a mirror descent step algorithm, with iterates $\{\mathsf {z}_k\}.$ The name “linear coupling” stems from the fact that, at iteration $k+1,$ the gradient of the objective is queried at a point $\mathsf {x}_{k+1}$, which is a linear combination of gradient and mirror steps, i.e., $\mathsf {x}_{k+1} = (1-\tau ) \cdot z_k + (1-\tau ) \cdot y_k$.

In this paper, we apply linear coupling in a very non-trivial manner. We design a gradient and a mirror descent step, each very specific to the underlying PC-LP problem. We also perform a coupling step $\mathsf {x}_{k+1} = (1-\tau ) \cdot z_k + (1-\tau ) \cdot y_k$, but need to design a different analysis to preserve width independence. None of these components has appeared in [5].

Arithmetic precision Throughout this paper, we assume exact arithmetic operations for presenting the cleanest proofs. If the updates are calculated within precision $\frac{1}{\textsf {poly}(\varepsilon ^{-1},n,m)}$, or equivalently when word size $O(\log (\varepsilon ^{-1}+n+m))$ is used, our results still hold.^{Footnote 5}

Roadmap We relax the packing LP in Sect. 2, and provide our packing LP solver in Sect. 3. We relax the covering LP in Sect. 4, and provide our covering LP solver in the well-behaved case in Sect. 5. In Sect. 6, we provide our full covering LP solver.

2 Relaxation of the packing linear program

To solve packing LP, we minimize a relaxed version of the original LP, where the hard constraint $Ax \le 1$ is regularized by entropy and replaced by an exponential penalty function.

Notations Recall that the packing LP in its standard form is $ \max _{x \ge 0} \{ {\mathbbm {1}}^{T} x \,:\, Ax \le {\mathbbm {1}}\}$. Let us denote by $\mathsf {OPT}$ the optimal value of this linear program, and $x^*$ any optimal solution. We say that x is a $(1-\varepsilon )$-approximation for the packing LP if $Ax \le {\mathbbm {1}}$ and ${\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}$.

Throughout this paper, we use the indices $i\in [n]$ to denote the columns of A, and the indices $j\in [m]$ to denote the rows of A. We let $A_{:i}$ be the i-th column vector of A, and $A_{j :}$ the j-th row vector of A. Given any vector x, we denote by $\Vert x\Vert _A = \sqrt{\sum _{i\in [n]} x_i^2 \cdot \Vert A_{:i}\Vert _{\infty }}$ the A-norm of x. By simple scaling, we can assume without loss of generality that^{Footnote 6}

$$\begin{aligned} \min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\}=1 . \end{aligned}$$

(2.1)

We restrict the domain of x and the range of $\mathsf {OPT}$ as follows.

Fact 2.1

Define the bounding box $\varDelta _{\mathsf {box}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ x \in \mathbb {R}^n \,:\, x_i \in \big [0, \frac{1}{\Vert A_{:i}\Vert _{\infty }}\big ] \}$. Under assumption (2.1), we have $\mathsf {OPT}\in [1,n]$ and $\{x : x\ge 0 \wedge Ax \le {\mathbbm {1}}\} \subseteq \varDelta _{\mathsf {box}}.$

Proof

Suppose that $i^*$ is the column that achieves the smallest infinite norm $\Vert A_{:i}\Vert _{\infty }$ over all columns. Letting x be such that $x_i=1$ at $i=i^*$ and $x_i=0$ at $i\ne i^*$, we claim that x is a feasible solution for the packing LP (1.1), simply because $\Vert A_{:i^*}\Vert _{\infty } = 1$ according to (2.1). This feasible solution x yields an objective value ${\mathbbm {1}}^{T} x = 1$, proving that $\mathsf {OPT}\ge 1$. On the other hand, for any solution $x \ge 0$ satisfying $Ax \le {\mathbbm {1}}$, we must have $x_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}$ for each i. Therefore, ${\mathbbm {1}}^{T} x \le \sum _i \frac{1}{\Vert A_{:i}\Vert _{\infty }} \le n$, proving that $\mathsf {OPT}\le n$.

The inclusion $\{x : x\ge 0 \wedge Ax \le {\mathbbm {1}}\} \subseteq \varDelta _{\mathsf {box}}$ is obvious, since the constraints $x\ge 0$ and $Ax \le {\mathbbm {1}}$ together imply $x_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}$ for every $i\in [n]$. $\square $

This bounding-box constraint allows us to focus only on searching x in $\varDelta _{\mathsf {box}}$.

Our regularized objective We now introduce the smoothed objective $f_\mu (x)$ that we minimize over $\varDelta _{\mathsf {box}}$ in order to approximately solve packing LP. At a high level, this objective $f_\mu (x)$ turns each row of the hard, non-smooth LP constraint $Ax \le {\mathbbm {1}}$ into an exponential penalty function so that we only need to require $x \in \varDelta _{\mathsf {box}}$ throughout the algorithm.

Formally, the packing LP can be written as the following minimization problem by introducing the Lagrangian variable $y \in \mathbb {R}^m$:

$$\begin{aligned}&\min _{x \in \varDelta _{\mathsf {box}}} \big \{ - {\mathbbm {1}}^{T} x + \max _{y \ge 0} \{y^T A x - {\mathbbm {1}}^{T}y \} \big \} . \end{aligned}$$

(2.2)

The problem can be now smoothened by introducing a concave regularizer over $y \ge 0.$ We take this regularizer to be the generalized entropy $H(y) = - \sum _{j=1}^m y_j \log y_j + y_j$ over the first orthant $y\ge 0$, and minimize the following smoothened objective $f_\mu (x)$ over $x \in \varDelta _{\mathsf {box}}$:

$$\begin{aligned} f_\mu (x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}- {\mathbbm {1}}^{T} x + \max _{y \ge 0} \{y^T A x - {\mathbbm {1}}^{T}y + \boxed {\mu \cdot H(y)} \} . \end{aligned}$$

(2.3)

Above, $\mu > 0$ is some smoothing parameter to be chosen later. By explicitly computing the maximization over $y\ge 0$, $f_\mu (x)$ can be rewritten as

Fact 2.2

$ f_\mu (x) = \mu \sum _{j=1}^m e^{\frac{1}{\mu } ((Ax)_j - 1)} - {\mathbbm {1}}^{T} x .$

We study the minimization problem on $f_\mu (x)$ over $x \in \varDelta _{\mathsf {box}}$. Intuitively $f_\mu (x)$ captures the original packing LP (1.1) as follows. Firstly, since we want to maximize ${\mathbbm {1}}^{T} x$, the negative term $-{\mathbbm {1}}^{T} x$ shows up in $f_\mu (x)$. Secondly, if a packing constraint $j\in [m]$ is violated by $\varepsilon $, that is, $(Ax)_j \ge 1+\varepsilon $, the exponential penalty in $f_\mu (x)$ introduces a penalty at least $\mu e^{\varepsilon /\mu }$; this will be a large penalty if $\mu \le O(\varepsilon / \log (n/\varepsilon ))$.

Remark 2.3

The use of exponential function at least traces back to [31] in 1991 (implicitly) and to [20] in 1994 (explicitly). The way most previous results minimize $f_\mu (x)$ is by taking a logarithm $g(x) = \log \big ( \sum _{j=1}^m e^{((Ax)_j - 1)/\mu } \big )$, explicitly or implicitly arguing that g(x) is Lipschitz smooth (i.e., $\Vert \nabla ^2 f(x)\Vert $ is bounded), and then taking gradient descent.^{Footnote 7} Unfortunately, the Lipschitz smoothness parameter of g(x) depends on the width of the LP, and thus first-order iterative approaches based on directly minimizing g(x) are mostly width-dependent [7, 26, 29, 31]. One can also reduce the width parameter of g(x) which yields super linear-time solvers [13, 15, 27].

In this paper, we directly perform gradient descent and mirror descent on $f_\mu (x)$—without taking the logarithm. Note that traditional accelerated gradient methods [28, 29] should not be applied directly to minimize $f_\mu $ because it is not Lipschitz smooth.^{Footnote 8}

Our $f_\mu (x)$ incurs a regularization error. The next proposition bounds this error following a similar treatment in [4].

Proposition 2.4

Let $\mu = \frac{\varepsilon }{4\log (nm/\varepsilon )}$ and recall $x^*$ is an optimal solution for packing LP.

(a)
$f_\mu (u^*) \le - (1-\varepsilon )\mathsf {OPT}$ for $u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1-\varepsilon /2) x^* \in \varDelta _{\mathsf {box}}$.
(b)
$f_\mu (x) \ge -(1+\varepsilon )\mathsf {OPT}$ for every $x \in \varDelta _{\mathsf {box}}$.
(c)
If $x\in \varDelta _{\mathsf {box}}$ satisfies $f_\mu (x) \le -(1-\theta )\mathsf {OPT}$ for some $\theta \in [0,1]$, then $\frac{1}{1+\varepsilon } x$ is a $\frac{1-\theta }{1+\varepsilon }$-approximate solution to the packing LP.

Remark 2.5

Our box constraint $x \in \varDelta _{\mathsf {box}}$ is almost redundant for minimizing $f_\mu (x)$: whenever $x\ge 0$ and $f_\mu (x) \le 0$, one should automatically have $x_i \le \frac{1+\varepsilon }{\Vert A_{:i}\Vert _{\infty }}$. However, this constraint shall be used to make sure that our updates are always inside $\varDelta _{\mathsf {box}}$.

Proof of Proposition 2.4

(a)
We have ${\mathbbm {1}}^{T} u^* = (1-\varepsilon /2)\mathsf {OPT}$ by the definition of $\mathsf {OPT}$. from the feasibility $A x^* \le {\mathbbm {1}}$ in the packing LP, we have $A u^* - {\mathbbm {1}}\le -\varepsilon /2 \cdot {\mathbbm {1}}$, and can compute $f_\mu (u^*)$ as follows:
$$\begin{aligned} f_\mu (u^*)&= \mu \sum _j e^{\frac{1}{\mu } ((A u^*)_j - 1)} - {\mathbbm {1}}^{T} u^*\\&\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} - (1-\varepsilon /2)\mathsf {OPT}\\&\le \frac{\mu m}{(nm)^2} - (1-\varepsilon /2)\mathsf {OPT}\le -(1-\varepsilon )\mathsf {OPT}. \end{aligned}$$
(b)
Suppose towards contradiction that $f_\mu (x) < -(1+\varepsilon )\mathsf {OPT}$. Since $f_\mu (x) > -{\mathbbm {1}}^{T} x$, it must satisfy that ${\mathbbm {1}}^{T} x > (1+\varepsilon )\mathsf {OPT}$. Suppose that ${\mathbbm {1}}^{T} x = (1+v)\mathsf {OPT}$ for some $v>\varepsilon $. By the definition of $\mathsf {OPT}$, we must have that $A x < (1+v) {\mathbbm {1}}$ is broken, and therefore there exists some $j \in [m]$ satisfying that $(Ax)_j \ge 1+v$. In such a case, the objective
$$\begin{aligned} f_\mu (x)\ge & {} \mu e^{v/\mu } - (1+v)\mathsf {OPT}= \frac{\varepsilon }{4\log (nm/\varepsilon )} \left( \left( \frac{nm}{\varepsilon }\right) ^4\right) ^{v/\varepsilon } - (1+v)\mathsf {OPT}\\\ge & {} \left( \left( \left( \frac{nm}{\varepsilon }\right) ^2\right) ^{v/\varepsilon } - (1+v) \right) \mathsf {OPT}> 0 \end{aligned}$$
giving a contradiction to the assumption that $f_\mu (x)<0$.
(c)
Note that x satisfies $f_\mu (x)\le -(1-\delta )\mathsf {OPT}\le 0$, and we first show $Ax \le (1+\varepsilon ){\mathbbm {1}}$. Let us assume that $v = \max _j ((Ax)_j - 1) \ge 0$ because otherwise we will have $Ax \le {\mathbbm {1}}$. Under this definition, we have $Ax \le (1+v) {\mathbbm {1}}$ and therefore ${\mathbbm {1}}^{T} x \le (1+v)\mathsf {OPT}$ by the definition of $\mathsf {OPT}$. We compute $f_\mu (x)$ as follows.
$$\begin{aligned}&f_\mu (x) \ge \mu e^{\frac{v}{\mu }} - (1+v)\mathsf {OPT}\ge \mu \left( \left( \frac{nm}{\varepsilon }\right) ^4\right) ^{v/\varepsilon } - (1+v)n \\&\quad = \frac{\varepsilon }{4\log (nm/\varepsilon )} \left( \left( \frac{nm}{\varepsilon }\right) ^4\right) ^{v/\varepsilon } - (1+v)n . \end{aligned}$$
The above quantity is positive whenever $v \ge \varepsilon $, and therefore, to satisfy $f_\mu (x)\le 0$ we must have $v\le \varepsilon $, which is equivalent to $Ax \le (1+\varepsilon ){\mathbbm {1}}$. Next, because $-{\mathbbm {1}}^{T} x \le f_\mu (x) \le -(1-\delta )\mathsf {OPT}$, we know ${\mathbbm {1}}^{T} x \ge (1-\delta )\mathsf {OPT}$. Letting $x' = \frac{1}{1+\varepsilon } x$, we both have that $x'$ is feasible (i.e., $Ax' \le {\mathbbm {1}}$), and $x'$ has an objective ${\mathbbm {1}}^{T} x'$ at least as large as $ \frac{1-\delta }{1+\varepsilon }\mathsf {OPT}$.

$\square $

Some non-standard smoothness properties The gradient and Hessian of $f_\mu (x)$ can be written in the following closed forms:

Fact 2.6

$\nabla f_\mu (x) = A^T p(x) - {\mathbbm {1}}$ and $\nabla ^2 f_\mu (x) = \frac{1}{\mu } A^T \mathrm {diag}\{p(x)\}A$, where $p_j(x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}e^{\frac{1}{\mu } ((Ax)_j - 1)}$.

By staring at these closed forms, we note that $f_\mu (x)$ is not Lipschitz-smooth: for instance, each $\nabla ^2_{ii} f_\mu (x)$ can go to infinity so the spectral norm of $\nabla ^2 f_\mu (x)$ is unbounded. However, the non-negativity of A guarantees that whenever $\nabla ^2_{ii} f_\mu (x)$ is large for some coordinate i, the corresponding entry of the gradient $\nabla _i f_\mu (x)$ must also be large. This still allows us to take a larger step in direction $\mathbf {e}_i$ than traditionally allowed by coordinate descent.

The above intuition is formalized in the next lemma, whose proof is by simple manipulation of Hessian. The first half of the lemma is the same as the traditional coordinate Lipschitz-smoothness property, but holds only conditionally; the second half is a salient characteristic of this work and requires the non-negativity of A. These smoothness properties will be crucial in applying gradient descent arguments in Sect. 3.3, and are the main motivation for us to adopt the $\Vert \cdot \Vert _A$ norm for our proposed algorithms.

Lemma 2.7

Let $L {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{4}{\mu }.$ Then, for every $x \ge 0$, every $i \in [n]$, and every $\lambda \in \big [- \frac{1}{L \Vert A_{:i}\Vert _{\infty }} , \frac{1}{L \Vert A_{:i}\Vert _{\infty }}\big ]$:

(a)
If $|\nabla _i f_\mu (x)| \le 1,$ then $ \big |\nabla _i f_\mu (x + \lambda \mathbf {e}_i) - \nabla _i f_\mu (x) \big | \le L \Vert A_{:i}\Vert _{\infty }\cdot | \lambda | . $
(b)
If $\nabla _i f_\mu (x) \ge 1,$ then $ \nabla _i f_\mu (x + \lambda \mathbf {e}_i) \ge \left( 1-\frac{\Vert A_{:i}\Vert _{\infty }L}{2} |\lambda |\right) \nabla _i f_\mu (x) . $

Proof of Lemma 2.7

Using the fact that $\nabla _i f_\mu (x) > -1$ for all x, we have:

$$\begin{aligned} \Big | \log \frac{\nabla _i f_\mu (x + \lambda \mathbf {e}_i) +1}{\nabla _i f_\mu (x)+1} \Big |&\overset{\textcircled {{\small 1}}}{=} \Big | \int ^{\lambda }_0 \frac{\nabla ^2_{ii} f_\mu (x+\nu \mathbf {e}_i)}{\nabla _i f_\mu (x+\nu \mathbf {e}_i) +1} d \nu \Big | \\&\overset{\textcircled {{\small 2}}}{=} \frac{1}{\mu } \Big | \int ^{\lambda }_0 \frac{(A^T \mathrm {diag}\{p(x + \nu \mathbf {e}_i)\} A)_{ii}}{(A^T p(x + \nu \mathbf {e}_i))_i}d \nu \Big |\\&\overset{\textcircled {{\small 3}}}{\le }\frac{\Vert A_{:i}\Vert _{\infty }}{\mu } | \lambda | \overset{\textcircled {{\small 4}}}{=} \frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda | . \end{aligned}$$

Above, $\textcircled {{\small 1}}$holds because $\int _{0}^\lambda g'(\nu ) d \nu = g(\lambda ) - g(0)$ where $g(\nu ) = \log (\nabla _i f_\mu (x+\nu \mathbf {e}_i)+1)$; $\textcircled {{\small 2}}$holds according to Fact 2.6; $\textcircled {{\small 3}}$is because the numerator is $\sum _j A_{j,i}^2 p_j$ while the denominator is $\sum _j A_{j,i} p_j$; $\textcircled {{\small 4}}$holds because $L = \frac{4}{\mu }$. This immediately implies

$$\begin{aligned} e^{-\frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda |} \le \frac{\nabla _i f_\mu (x + \lambda \mathbf {e}_i) +1}{\nabla _i f_\mu (x)+1} \le e^{\frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda |}. \end{aligned}$$

Our assumption on $\lambda $ implies $\frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda | \le \frac{1}{4},$ so that we can use the approximation $ x \le e^{x} -1 \le 1.2 x$ over $x \in [-\frac{1}{4}, \frac{1}{4}].$ This yields the simpler bound:

$$\begin{aligned} - \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda | \le \frac{\nabla _i f_\mu (x + \lambda \mathbf {e}_i) - \nabla _i f_\mu (x)}{\nabla _i f_\mu (x)+1} \le 1.2 \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda |. \end{aligned}$$

(a)
Assuming that $\nabla _i f_\mu (x) \in (-1,1],$ we have:
$$\begin{aligned} \Big |\nabla _i f_\mu (x + \lambda \mathbf {e}_i) - \nabla _i f_\mu (x)\Big | \le 2.4 \cdot \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda | \le \Vert A_{:i}\Vert _{\infty }L |\lambda | . \end{aligned}$$
(b)
Assuming $\nabla _i f_\mu (x) \ge 1$, we have
$$\begin{aligned} \nabla _i f_\mu (x+\lambda \mathbf {e}_i)\ge & {} \nabla _i f_\mu (x) - \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda | \left( \nabla _i f_\mu (x) + 1 \right) \\\ge & {} \left( 1-\frac{\Vert A_{:i}\Vert _{\infty }L}{2} |\lambda |\right) \nabla _i f_\mu (x) . \end{aligned}$$
$\square $

Initialization Iterative methods require a starting point, and we use the following one

Fact 2.8

Let $x^{\mathsf {start}}_i {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{1-\varepsilon /2}{n\Vert A_{:i}\Vert _{\infty }}$ for each $i\in [n]$. Then, $x^{\mathsf {start}}\in \varDelta _{\mathsf {box}}$ and $f_\mu (x^{\mathsf {start}}) \le -\frac{1-\varepsilon }{n}$.

Proof

Using the fact that $Ax^{\mathsf {start}}- {\mathbbm {1}}\le -\varepsilon /2 \cdot {\mathbbm {1}}$, we compute $f_\mu (x^{\mathsf {start}})$ as follows:

$$\begin{aligned} f_\mu (x^{\mathsf {start}})&= \mu \sum _j e^{\frac{1}{\mu } ((Ax^{\mathsf {start}})_j - 1)} - {\mathbbm {1}}^{T} x^{\mathsf {start}}\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} - \frac{1-\varepsilon /2}{n} \\&\le \frac{\mu m}{(nm)^2} - \frac{1-\varepsilon /2}{n} \le -\frac{1-\varepsilon }{n} . \end{aligned}$$

Above, we have used ${\mathbbm {1}}^{T} x^{\mathsf {start}}\ge x^{\mathsf {start}}_{i} = \frac{1-\varepsilon /2}{n}$, where i is the column s.t. $\Vert A_{:i}\Vert _{\infty }=1$. $\square $

3 Our packing LP solver

Recall traditional (accelerated or not) gradient descent [28, 29] or coordinate descent [6, 17, 30] should not be applied directly to minimize $f_\mu $, because $f_\mu $ is not Lipschitz-smooth.

Our proposed algorithm $PacLPSolver$ starts with some initial vector $\mathsf {x}_0 = \mathsf {y}_0 = x^{\mathsf {start}}$ (see Fact 2.8) and $\mathsf {z}_0 = 0$, and is divided into T iterations. In each iteration k, it computes a weighted midpoint $\mathsf {x}_k \leftarrow \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}$ for some parameter $\tau \in (0,1)$. This step is analogous to that in traditional accelerated coordinate descent [6, 17, 30]. We then compute $\mathsf {y}_k$ and $\mathsf {z}_k$ as follows.

We select $i\in [n]$ uniformly at random. Let $\xi _k^{(i)} = (0,\dots ,0,\mathbb {T}^\mathsf {p}(v),0,\dots ,0)$ be the vector that is only non-zero at coordinate i, where $v = \nabla _i f_\mu (\mathsf {x}_k) \in [-1,\infty )$, and $\mathbb {T}^\mathsf {p}(v)$ is the thresholding function $\mathbb {T}^\mathsf {p}(v) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\min \{v, 1\}$. We refer to $\xi _k^{(i)}$ as the truncated gradient.^{Footnote 9} Next,

Perform a mirror (descent) step$\mathsf {z}_k \leftarrow \mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}$ for some parameter $\alpha _k \ll 1/n$ to be chosen later.
Perform a gradient (descent) step$\mathsf {y}_k \leftarrow \mathsf {y}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mathsf {x}_k + \frac{1}{n \alpha _k L} (\mathsf {z}_k^{(i)} - \mathsf {z}_{k-1})$.

This finishes the description of our $PacLPSolver$.

Remark 3.1

We use the superscript $^{(i)}$ on $\xi _k^{(i)}$, $\mathsf {y}_k^{(i)}$ and $\mathsf {z}_k^{(i)}$ to emphasize that the value depends on the choice of i. We use generic parameters $\tau ,\alpha _k,T$ in the above description and their precise values are presented in Algorithm 1.

Our update on $\mathsf {y}_k$ is a “gradient descent step” because we shall prove that it strictly decreases the objective (i.e., $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\ge 0$). Our update on $\mathsf {z}_k$ is a “mirror descent step” because we shall apply standard mirror descent analysis [12] to it. We explicitly describe how to implement the mirror step (its proof is straightforward by computing the gradient):

Proposition 3.2

If $\varDelta _{\mathsf {box}}= \{ x \in \mathbb {R}^n \,:\, x_i \in \big [0, \frac{C}{\Vert A_{:i}\Vert _{\infty }}\big ] \}$ for some constant $C > 0$, the minimizer $z = {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle \delta \mathbf {e}_i, z \rangle \big \}$ for any $\delta \in \mathbb {R}$ and basis vector $\mathbf {e}_i$ can be computed as follows:

1.
$z \leftarrow \mathsf {z}_{k-1}$.
2.
$z_i \leftarrow z_i - \delta /\Vert A_{:i}\Vert _{\infty }$.
3.
If $z_i < 0$, then $z_i \leftarrow 0$; if $z_i > C/\Vert A_{:i}\Vert _{\infty }$, $z_i \leftarrow C/\Vert A_{:i}\Vert _{\infty }$.
4.
Return z.

We also point out that

Lemma 3.3

Each iteration of $PacLPSolver$ can be implemented to run in expected O(N / n) time. The total expected running time is O(TN / n).

Lemma 3.3 is not hard to prove, but anyways included in “Appendix E.5”. It follows from standard implementation tricks which compute $\mathsf {x}_k$ and $\mathsf {y}_k$ only implicitly: that is to express $\mathsf {x}_k$ and $\mathsf {y}_k$ as linear combinations of two less-frequently-updated vectors.

3.1 Convergence statement

In this section, we focus on proving the following main theorem.

It is straightforward to use Markov’s bound to turn Theorem 3.4 into a probabilistic one

Proof of Corollary 3.5

Since for every $x\in \varDelta _{\mathsf {box}}$ it satisfies $f_\mu (x) \ge -(1+\varepsilon )\mathsf {OPT}$ according to Proposition 2.4.b, we obtain that $f_\mu (y_T) + (1+\varepsilon )\mathsf {OPT}$ is a random variable that is non-negative, whose expectation $\mathbf {E}[f_\mu (y_T) + (1+\varepsilon )\mathsf {OPT}] \le 4\varepsilon $ according to Theorem 3.4. By Markov bound, with at least probability 2 / 3, we obtain some $y_T$ satisfying $f_\mu (y_T) \le - (1-11\varepsilon )\mathsf {OPT}$, which yields a $(1-O(\varepsilon ))$ approximate solution to packing LP according to Proposition 2.4.c. The running time follows from Lemma 3.3. $\square $

Before we prove prove Theorem 3.4 in subsequent subsections, let us first point out that our iterates $\mathsf {x}_k,\mathsf {y}_k,z_k$ never leave the bounding box $\varDelta _{\mathsf {box}}$:

Lemma 3.6

We have $\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}$ for all $k=0,1,\dots ,T$.

(The proof of Lemma 3.6 is included in “Appendix A.1”, and the main technique already appeared in randomized coordinate descent [17].)

3.2 Step 1: mirror descent guarantee

Following almost classical analysis of mirror descent (cf. textbook [12]), our update $\mathsf {z}_k^{(i)} = {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}$ satisfies

Lemma 3.7

(mirror descent) For every $u\in \varDelta _{\mathsf {box}}$, it satisfies

$$\begin{aligned} \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - u \big \rangle \le n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 . \end{aligned}$$

Proof

Denoting by $V_a(b) = \frac{1}{2}\Vert b - a\Vert _A^2$ as a function of $b \in \varDelta _{\mathsf {box}}$ parameterized at $a \in \varDelta _{\mathsf {box}}$, we have $\nabla _i V_a(b) = \Vert A_{:i}\Vert _{\infty }\cdot (a_i - b_i)$. In the optimization language, $V_a(b)$ is the Bregman divergence of the $\Vert \cdot \Vert _A^2$ regularizer [12]. We derive the following sequence of inequalities:

$$\begin{aligned} \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - u \big \rangle&= \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + \big \langle n \alpha _k \xi _{k}^{(i)}, \mathsf {z}_k^{(i)} - u \big \rangle \\&\overset{\textcircled {{\small 1}}}{\le }\big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + \big \langle - \nabla V_{\mathsf {z}_{k-1}}(\mathsf {z}_k^{(i)}), \mathsf {z}_k^{(i)} - u \big \rangle \\&\overset{\textcircled {{\small 2}}}{=} \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle - \frac{1}{2}\Vert \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)}\Vert _A^2 + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2\\&\quad - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \\&\overset{\textcircled {{\small 3}}}{=} n^2 \alpha _k^2 L \left( \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k \big \rangle - \frac{L}{2}\Vert \mathsf {x}_{k} - \mathsf {y}_k\Vert _A^2 \right) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 \\&\quad - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \\&\le n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k \big \rangle + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 . \end{aligned}$$

Above, $\textcircled {{\small 1}}$is due to the minimality of $\mathsf {z}_k^{(i)} = {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{V_{\mathsf {z}_{k-1}}(z) + \big \langle n\alpha _k \xi _{k}^{(i)}, z \big \rangle \big \}$, which implies that $ \big \langle \nabla V_{\mathsf {z}_{k-1}}(\mathsf {z}_k^{(i)}) + n \alpha \xi _k^{(i)}, u - \mathsf {z}_k^{(i)} \big \rangle \ge 0$ for all $u\in \varDelta _{\mathsf {box}}$. Equality $\textcircled {{\small 2}}$can be checked for every coordinate $\ell \in [n]$ as follows:

$$\begin{aligned} - \nabla _\ell V_{\mathsf {z}_{k-1}}(\mathsf {z}_k^{(i)}) \cdot (\mathsf {z}_{k,\ell }^{(i)}-u_\ell )&= \Vert A_{:i}\Vert _{\infty }(\mathsf {z}_{k-1,\ell } - \mathsf {z}_{k,\ell }^{(i)}) \cdot (\mathsf {z}_{k,\ell }^{(i)}-u_\ell ) \\&= \Vert A_{:i}\Vert _{\infty }\left( - \frac{1}{2}(\mathsf {z}_{k-1,\ell }-\mathsf {z}_{k,\ell }^{(i)})^2 + \frac{1}{2}(u_\ell - \mathsf {z}_{k-1,\ell })^2 \right. \\&\quad \left. - \frac{1}{2}(\mathsf {z}_{k,\ell }^{(i)} - u_\ell )^2\right) . \end{aligned}$$

$\textcircled {{\small 3}}$is by our choice of $\mathsf {y}_k$ which satisfies that $\mathsf {z}_{k-1} -\mathsf {z}_k^{(i)} = n \alpha _k L (\mathsf {x}_k - \mathsf {y}_k^{(i)})$. $\square $

In addition, as a simple corollary of Proposition 3.2, we have the following fact

Fact 3.8

$|\mathsf {z}_{k,i}^{(i)} - \mathsf {z}_{k-1,i}| \le \frac{n \alpha _k |\xi _{k,i}^{(i)}|}{\Vert A_{:i}\Vert _{\infty }}$ and $|\mathsf {y}_{k,i}^{(i)} - \mathsf {x}_{k,i}| = \frac{1}{n \alpha _k L} |\mathsf {z}_{k,i}^{(i)} - \mathsf {z}_{k-1,i}| \le \frac{|\xi _{k,i}^{(k)}|}{L\Vert A_{:i}\Vert _{\infty }} \le \frac{1}{L\Vert A_{:i}\Vert _{\infty }}$. If $\xi _{k,i}^{(i)} \ge 0$, then $\mathsf {z}_{k,i}^{(i)} \le \mathsf {z}_{k-1,i}$ and $\mathsf {y}_{k,i}^{(i)} \le \mathsf {x}_{k,i}$; if $\xi _{k,i}^{(i)} \le 0$, then $\mathsf {z}_{k,i}^{(i)} \ge \mathsf {z}_{k-1,i}$ and $\mathsf {y}_{k,i}^{(i)} \ge \mathsf {x}_{k,i}$.

3.3 Step 2: gradient descent guarantee

We call our update $\mathsf {y}_k^{(i)} \leftarrow \mathsf {x}_k + \frac{1}{n \alpha _k L} (\mathsf {z}_k^{(i)} - \mathsf {z}_{k-1})$ a gradient descent step, because the following lemma guarantees $f_\mu (\mathsf {y}_k^{(i)}) \le f_\mu (\mathsf {x}_k)$, that is, the objective only decreases; moreover, the objective decreases at least by $\frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle $.

Lemma 3.9

(gradient descent) We have $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \rangle \ge 0$.

This Lemma 3.9, which is characteristic of the PC-LP setting, is strong in the following sense. Even though the update $y_k^{(i)}$ only depends on the truncated gradient $\xi ^{(i)}_{k}$, the progress we make is a function of the true gradient $\nabla _i f_\mu (\mathsf {x}_k)$, including the large component that was discarded. This is possible because the smoothness guarantee of Lemma 2.7.b allows us to take a long coordinate step even though $f_\mu (x)$ is not Lipschitz-smooth.

Proof of Lemma 3.9

Using Fact 3.8, we write $\mathsf {y}_k^{(i)} = \mathsf {x}_k + s \lambda \mathbf {e}_i$ for some $s\in \{-1,+1\}$ and step length $\lambda \in \big [0, \frac{1}{L \Vert A_{:i}\Vert _{\infty }}\big ]$. We first focus on the case $\nabla _i f_\mu (\mathsf {x}_k) \in [-1,1]$ so $\xi _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k)$.

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) =f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {x}_k + s \lambda \mathbf {e}_i) = - s \int _0^\lambda \left( \nabla _i f_\mu (\mathsf {x}_k + s \chi \mathbf {e}_i) \right) d \chi \\&\quad \overset{\textcircled {{\small 1}}}{\ge }s \int _0^\lambda \left( - \nabla _i f_\mu (\mathsf {x}_k) - L \Vert A_{:i}\Vert _{\infty }\cdot s \chi \right) d \chi = - \nabla _i f_\mu (\mathsf {x}_k) \cdot s \lambda - \frac{L \Vert A_{:i}\Vert _{\infty }}{2} \cdot \lambda ^2 \\&\quad \overset{\textcircled {{\small 2}}}{\ge }- \nabla _i f_\mu (\mathsf {x}_k) \cdot s \lambda - \frac{L \Vert A_{:i}\Vert _{\infty }}{2} \cdot \lambda \cdot \frac{|\xi _{k,i}^{(k)}|}{L\Vert A_{:i}\Vert _{\infty }} \overset{\textcircled {{\small 3}}}{=} - \frac{1}{2}\langle \nabla f_\mu (\mathsf {x}_k) , \mathsf {y}_k^{(i)} - \mathsf {x}_k \rangle . \end{aligned}$$

Above, $\textcircled {{\small 1}}$uses Lemma 2.7.a, $\textcircled {{\small 2}}$uses Fact 3.8, and $\textcircled {{\small 3}}$uses $|\xi _{k,i}^{(k)}| = -s \nabla _i f_\mu (\mathsf {x}_k)$ (see also Fact 3.8). Next, we turn to the case of $\nabla _i f_\mu (\mathsf {x}_k) > 1$. In this case, we have $s=-1$ and

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) =f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {x}_k - \lambda \mathbf {e}_i) = \int _0^\lambda \nabla _i f_\mu (\mathsf {x}_k - \chi \mathbf {e}_i) d \chi \\&\quad \overset{\textcircled {{\small 1}}}{\ge }\int _0^\lambda \left( 1-\frac{\Vert A_{:i}\Vert _{\infty }L}{2} \chi \right) \nabla _i f_\mu (x) d \chi \\&\quad \overset{\textcircled {{\small 2}}}{\ge }\int _0^\lambda \frac{1}{2} \nabla _i f_\mu (x) d \chi = \frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle . \end{aligned}$$

Above, $\textcircled {{\small 1}}$uses Lemma 2.7.b and $\textcircled {{\small 2}}$uses $\chi \le \lambda \le \frac{1}{L \Vert A_{:i}\Vert _{\infty }}$. Finally, we have $\langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle \ge 0$ because $\nabla _i f_\mu (\mathsf {x}_k)$ and $\mathsf {x}_{k,i} - \mathsf {y}_{k,i}^{(i)}$ have the same sign, and $\mathsf {x}_{k,\ell } = \mathsf {y}_{k,\ell }^{(i)}$ for $\ell \ne i$.

3.4 Step 3: putting all together

We denote by $\eta _{k}^{(i)} \in \mathbb {R}_{\ge 0}^n $ the vector that is only non-zero at coordinate i, and satisfies $\eta _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) - \xi _{k,i}^{(i)} \in [0,\infty )$. In other words, the full gradient

$$\begin{aligned} \nabla f_\mu (\mathsf {x}_k) = \mathbf {E}_i[(0,\dots ,n \nabla _i f_\mu (\mathsf {x}_k),\dots ,0)] = \mathbf {E}_i[ n \eta _{k}^{(i)} + n \xi _k^{(i)} ] \end{aligned}$$

can be (in expectation) decomposed into the a large non-negative component $\eta _k^{(i)} \in [0,\infty )^n$ and a truncated component $\xi _k^{(i)} \in [-1,1]^n$. Recall that $\eta _k^{(i)}$ did not contribute to the descent steps (see Line 9 of $PacLPSolver$). Now, for any $u \in \varDelta _{\mathsf {box}}$, we can use a basic convexity argument and the mirror descent lemma to compute that

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1}-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{=} \frac{(1-\tau ) \alpha _k}{\tau } \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {y}_{k-1} - \mathsf {x}_k \big \rangle \nonumber \\&\qquad + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \end{aligned}$$

(3.1)

$$\begin{aligned}&\quad \overset{\textcircled {{\small 2}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [\; \boxed {\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle } \nonumber \\&\qquad + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \Big ] \end{aligned}$$

(3.2)

Above, $\textcircled {{\small 1}}$is because $\mathsf {x}_{k} = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}$, which implies that $\tau (\mathsf {x}_k - \mathsf {z}_{k-1}) = (1-\tau ) (\mathsf {y}_{k-1} - \mathsf {x}_k)$. $\textcircled {{\small 2}}$uses convexity and Lemma 3.7. This above computation is motivated by [5], and as we shall see below, it allows one to linearly couple gradient and mirror steps.

Intuitively, the first term in the box of (3.2) is the loss introduced by the large gradient $\eta _k^{(i)}$. This part was truncated so did not contribute to the mirror step. The second term in the box is the loss introduced by mirror descent on the small gradient $\xi _k^{(i)}$ in Lemma 3.7.

Now comes an important observation. As shown by Lemma 3.10 below, the performance of the gradient step—that is, the objective decrease of $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})$—is at least proportional to the total loss incurred in the box. Intuitively, this means that the progress in the gradient step is so large that it outweighs not only the loss from mirror descent (as is typical in accelerated gradient analyses [5, 29]) but also the loss term introduced by $\eta _k^{(i)}$.

Lemma 3.10

(gradient descent total guarantee) For every $u\ge 0$,

$$\begin{aligned} \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 3n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})). \end{aligned}$$

The proof of Lemma 3.10 is a careful case analysis and several simple applications of Lemma 3.9. We remark that to properly upper bound $\langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \rangle $, one needs to have some good upper bound the coordinates of $\mathsf {z}_{k-1}$. This is exactly the place we need our redundant constraint which ensures $\mathsf {z}_{k-1,i} \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}$ (see Remark 2.5).

Proof of Lemma 3.10

There are three possibilities:

If $\eta _{k,i}^{(i)}=0$, then we must have $\xi _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) \in [-1,1]$. Lemma 3.9 implies
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad = n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 2 n^2 \alpha _k^2 L \cdot ( f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) ) \end{aligned}$$
If $\eta _{k,i}^{(i)} > 0$ and $\mathsf {z}_{k,i}^{(i)} > 0$, then we precisely have $\mathsf {z}_{k,i}^{(i)} = \mathsf {z}_{k-1,i} - \frac{n\alpha _k}{\Vert A_{:i}\Vert _{\infty }}$ (see Proposition 3.2), and accordingly $\mathsf {y}_{k,i}^{(i)} = \mathsf {x}_{k,i} - \frac{1}{L \Vert A_{:i}\Vert _{\infty }} < \mathsf {x}_{k,i}$. In this case,
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{1}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{<} n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{1}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} n \alpha _k L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }\big ( 2 n \alpha _k L + 2 n^2 \alpha _k^2 L \big ) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$
Above, $\textcircled {{\small 1}}$follows from the fact that $\mathsf {z}_{k-1} \in \varDelta _{\mathsf {box}}$ and therefore $\mathsf {z}_{k-1,i} \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}$ by the definition of $\varDelta _{\mathsf {box}}$, and $u\ge 0$; $\textcircled {{\small 2}}$follows from the fact that $\mathsf {x}_k$ and $\mathsf {y}_k^{(i)}$ are only different at coordinate i, and $\xi _{k,i}^{(i)}=1 < \nabla _i f_\mu (\mathsf {x}_k)$ (since $\eta _{k,i}^{(i)}>0$); $\textcircled {{\small 3}}$follows from the fact that $\mathsf {y}_{k}^{(i)} = \mathsf {x}_{k} - \frac{\mathbf {e}_i}{L \Vert A_{:i}\Vert _{\infty }}$; and $\textcircled {{\small 4}}$uses Lemma 3.9.
If $\eta _{k,i}^{(i)} > 0$ and $\mathsf {z}_{k,i}^{(i)} = 0$, then we have
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }\big ( n \alpha _k \nabla _i f_\mu (\mathsf {x}_k) \cdot \mathsf {z}_{k-1,i} \big ) + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{=} \big \langle n \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }4 n^2 \alpha _k^2 L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$
Above, $\textcircled {{\small 1}}$is because $u \ge 0$, $\nabla _i f_\mu (\mathsf {x}_k) = \eta _{k,i}^{(i)} + 1 > \eta _{k,i}^{(i)}$ and $\nabla _i f_\mu (\mathsf {x}_k) > \xi _{k,i}^{(i)}$; $\textcircled {{\small 2}}$uses the assumption that $\mathsf {z}_{k,i}^{(i)} = 0$ and the fact that $\mathsf {z}_{k-1,\ell }=\mathsf {z}_{k,\ell }^{(i)}$ for every $\ell \ne i$; $\textcircled {{\small 3}}$is from our choice of $\mathsf {y}_k$ which satisfies that $\mathsf {z}_{k-1} -\mathsf {z}_k^{(i)} = n \alpha _k L (\mathsf {x}_k - \mathsf {y}_k^{(i)})$; and $\textcircled {{\small 4}}$uses Lemma 3.9.

Combining the three cases, and using the fact that $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0$, we conclude that

$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \le (2n \alpha _k L + 4n^2 \alpha _k^2 L) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) \\&\quad \le 3 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

Above, the last inequality uses our choice of $\alpha _k$, which implies $n \alpha _k \le n \alpha _T = \frac{1}{\varepsilon L} \le \frac{1}{4}$.

Plugging Lemma 3.10 back to (3.2), we have

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\quad \quad + \mathbf {E}_{i} \Big [ 3 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] \nonumber \\&\quad \overset{\textcircled {{\small 2}}}{\le }\alpha _k f_\mu (\mathsf {x}_k) + \big (3n\alpha _k L - \alpha _k \big ) f_\mu (\mathsf {y}_{k-1}) \nonumber \\&\quad \quad + \mathbf {E}_{i} \Big [ -3 n \alpha _k L \cdot f_\mu (\mathsf {y}_k^{(i)}) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] . \end{aligned}$$

(3.3)

Above, $\textcircled {{\small 1}}$uses Lemma 3.10; and $\textcircled {{\small 2}}$is because we have chosen $\tau $ to satisfy $\frac{1}{\tau } = 3 n L $.

Next, recall that we have picked $\alpha _{k}$ so that $(3n L - 1) \alpha _k = 3n L \cdot \alpha _{k-1}$ in Algorithm 1. Telescoping (3.3) for $k=1,\dots ,T$ and choosing $u^*=(1-\varepsilon /2) x^*$, we have

$$\begin{aligned}&- \sum _{k=1}^T \alpha _k f_\mu (u^*) \le 3 f_\mu (\mathsf {y}_0) - 3 n\alpha _T L \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + \Vert \mathsf {z}_0 - u^*\Vert _A^2 \\&\quad \le - 3 n\alpha _T L \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + \mathsf {OPT}. \end{aligned}$$

Here, the second inequality is due to $f_\mu (\mathsf {y}_0) = f_\mu (x^{\mathsf {start}}) \le 0$ from Fact 2.8, and the fact that

$$\begin{aligned}\Vert \mathsf {z}_0 - u^*\Vert _A^2 = \Vert u^*\Vert _A^2 = \sum _{i=1}^n (u^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\le \sum _{i=1}^n (x^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\le \sum _{i=1}^n x^*_i = \mathsf {OPT}. \end{aligned}$$

Finally, using the fact that $\sum _{k=1}^T \alpha _k = \alpha _T \cdot \sum _{k=0}^{T-1} \big (1 - \frac{1}{3nL}\big )^k = 3 n \alpha _T L \big (1-(1-\frac{1}{3nL})^T\big )$, we rearrange and obtain that

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le&\frac{\sum _k \alpha _k}{3 n\alpha _T L} f_\mu (u^*) + \frac{1}{3 n\alpha _T L} \mathsf {OPT}= \big (1-\big (1-\frac{1}{3nL}\big )^T\big ) f_\mu (u^*) \\&+ \frac{1}{3 n\alpha _T L} \mathsf {OPT}. \end{aligned}$$

We choose $T = \lceil 3 n L \log (1/\varepsilon ) \rceil $ so that $\frac{1}{n\alpha _T L} = (1-\frac{1}{3 n L})^T \le \varepsilon $. Combining this with the fact that $f_\mu (u^*)\le -(1-\varepsilon )\mathsf {OPT}<0$ (see Proposition 2.4.a), we obtain

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le (1-\varepsilon )f_\mu (u^*) + \varepsilon /3 \cdot \mathsf {OPT}< -(1-3\varepsilon )\mathsf {OPT}. \end{aligned}$$

Therefore, we have finished proving Theorem 3.4. $\square $

4 Relaxation of the covering linear program

Since $PacLPSolver$ gives only an approximate packing LP solution, we cannot infer from it a dual covering LP solution. Therefore, we have to work on a relaxed version of covering LP directly. For input matrix $A \in \mathbb {R}_{\ge 0}^{m \times n}$, we rewrite the covering LP problem (1.2) as follows in order to be notationally close to packing LP:

$$\begin{aligned} \min _{x \ge 0 } \{ {\mathbbm {1}}^{T} x \,:\, A x \ge {\mathbbm {1}}\} . \end{aligned}$$

(4.1)

We denote by $\mathsf {OPT}$ the optimal value to this LP, and by $x^*$ any of its optimal solutions. We say that x is a $(1+\varepsilon )$-approximation for the covering LP if $A x \ge {\mathbbm {1}}$ and ${\mathbbm {1}}^{T} x \le (1+\varepsilon )\mathsf {OPT}$.

Again, we use indices $i\in [n]$ for the columns of A, and indices $j\in [m]$ for the rows of A. We denote by $A_{:i}$ the i-th column vector of A, and $A_{j :}$ the j-th row vector of A. We assume without loss of generality by simple scaling that^{Footnote 10}

$$\begin{aligned} \min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\}=1 . \end{aligned}$$

(4.2)

Proposition 4.1

The normalization (4.2) ensures $\mathsf {OPT}\in [1,m]$.

Proof

Suppose that $j^*$ is the row that achieves the smallest infinite norm $\Vert A_{j :}\Vert _{\infty }$ over all rows $j\in [m]$. Then, for any solution $x \in \mathbb {R}_{\ge 0}^n$ satisfying $\langle A_{j^* :}, x \rangle \ge 1$, we must have ${\mathbbm {1}}^{T} x \ge 1 / \Vert A_{j^* :} \Vert _{\infty } = 1$ using (4.2). On the other hand, we can construct a feasible solution x as follows. Initialize $x = 0$, and then for each row j, let us find the coordinate i that maximizes the value of $A_{ij}$ among all columns i. Then, we increase $x_i$ by $1/ A_{ij} = 1 / \Vert A_{j :}\Vert _{\infty }$. After we have exhausted all the m rows, we arrive at some $x \ge 0$ satisfying $Ax \ge {\mathbbm {1}}$ as well as ${\mathbbm {1}}^{T} x = \sum _j 1 / \Vert A_{j :}\Vert _{\infty }\le m$. $\square $

In our covering LP solvers, we assume that an initial solution, achieving a constant approximation, is available to the algorithm. Such a solution can be obtained for instance by the covering LP solver from Young [36] with constant $\epsilon $ in time $O(N \log N)$.

Definition 4.2

Let $x^{\sharp }$ be a given 2-approximate solution to the covering problem given and let $\mathsf {OPT}' {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\mathbbm {1}}^{T} x^{\sharp } \in [\mathsf {OPT}, 2\mathsf {OPT}]$. Without loss of generality, assume $\mathsf {OPT}' \ge 2$.

We now introduce the smoothed objective $f_\mu (x)$ we are going to minimize in order to solve covering LP. Symmetric to the case in the packing solver, this smoothed objective $f_\mu (x)$ turns each row of the LP constraint $Ax \ge {\mathbbm {1}}$ into an exponential penalty function.

Definition 4.3

Letting $\mu {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{\varepsilon }{4 \log (nm/\varepsilon )}$, we define the smoothed objective $f_\mu (x)$ as

$$\begin{aligned} \textstyle f_\mu (x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mu \sum _{j=1}^m e^{\frac{1}{\mu } (1 - (Ax)_j)} + {\mathbbm {1}}^{T} x . \end{aligned}$$

Fact 4.4

$\nabla f_\mu (x) = {\mathbbm {1}}- A^T p(x)$ and $\nabla ^2 f_\mu (x) = \frac{1}{\mu } A^T \mathrm {diag}\{p(x)\}A$, where $p_j(x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}e^{\frac{1}{\mu } (1 - (Ax)_j)}$.

We present some properties about $f_\mu (x)$. They together imply that the minimum of $f_\mu (x)$ is around $\mathsf {OPT}$, and if one approximately finds the minimum of $f_\mu (x)$ up to an additive error $O(\varepsilon \mathsf {OPT})$, this corresponds to a $(1+O(\varepsilon ))$-approximate solution to the covering LP (4.1). The proofs are analogous to Sect. 2, and included in “Appendix B.2” for completeness’ sake.

Proposition 4.5

(a)
$f_\mu (u^*) \le (1+\varepsilon )\mathsf {OPT}$ for $u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) x^*$.
(b)
$f_\mu (x) \ge (1-\varepsilon )\mathsf {OPT}$ for every $x \ge 0$.
(c)
For any $x \ge 0$ satisfying $f_\mu (x) \le 2\mathsf {OPT}$, we must have $Ax \ge (1-\varepsilon ){\mathbbm {1}}$.
(d)
If $x\ge 0$ satisfies $f_\mu (x) \le (1+\delta )\mathsf {OPT}$ for some $\delta \in [0,1]$, then $\frac{1}{1-\varepsilon } x$ is a $\frac{1+\delta }{1-\varepsilon }$-approximate solution to the covering LP.

5 Our covering LP solver in the well-conditioned case

Recall in packing LPs, since it satisfies $0 \le x^*_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}$ (see Fact 2.1), we can minimize $f_\mu $ over a bounding box $\varDelta _{\mathsf {box}}$. Unfortunately, it no longer satisfies $x^*_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}$ in covering LPs, so one cannot directly turn $PacLPSolver$ into its symmetric version to solve covering LP.

In this section, we show that this symmetric covering LP solver still solves all well-behaved covering LP instances. Specifically, we say the covering LP is well-behaved if:^{Footnote 11}

Assumption 5.1

There exists some optimal covering LP solution $x^*$ satisfying $x^*_i \le \frac{9}{\Vert A_{:i}\Vert _{\infty }}$; and the initial point $x^{\sharp }$ satisfies $x^{\sharp }_i \le \frac{9}{\Vert A_{:i}\Vert _{\infty }}$.

For instance, well-behaved instances naturally arise from those where the constraints $A x \ge {\mathbbm {1}}$ are non-redundant. If the optimal covering LP solution $x^*$ and the initial point $x^{\sharp }$ satisfy ${\mathbbm {1}}\le A x^* \le 9 \cdot {\mathbbm {1}}$ and ${\mathbbm {1}}\le A x^{\sharp } \le 9 \cdot {\mathbbm {1}}$, then Assumption 5.1 is satisfied.

Well-behaved covering LP problems immediately satisfy the following:

Fact 5.2

Define $\varDelta _{\mathsf {box}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ x \in \mathbb {R}^n \,:\, x_i \in \big [0, \frac{10}{\Vert A_{:i}\Vert _{\infty }}\big ] \}$. Under Assumption 5.1, we have $u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2)x^* \in \varDelta _{\mathsf {box}}$ and $x^{\mathsf {start}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) \cdot x^{\sharp } \in \varDelta _{\mathsf {box}}$. Also, it satisfies $f_\mu (x^{\mathsf {start}}) \le 3 \mathsf {OPT}$.

Proof

The claims $u^*, x^{\mathsf {start}}\in \varDelta _{\mathsf {box}}$ are trivial after noticing $\varepsilon \le 1/30$. Using the fact that $Ax^{\mathsf {start}}- {\mathbbm {1}}\ge (1+\varepsilon /2) A x^{\sharp } - {\mathbbm {1}}\ge \varepsilon /2 \cdot {\mathbbm {1}}$, we compute $f_\mu (x^{\mathsf {start}})$ as follows:

$$\begin{aligned} f_\mu (x^{\mathsf {start}})&= \mu \sum _j e^{\frac{1}{\mu } (1 - (Ax^{\mathsf {start}})_j)} + {\mathbbm {1}}^{T} x^{\mathsf {start}}\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} + 2\mathsf {OPT}\\&\le \frac{\mu m}{(nm)^2} + 2\mathsf {OPT}< 3 \mathsf {OPT}. \end{aligned}$$

$\square $

We now describe $CovLPSolver^{\mathsf {wb}}$ (which is a symmetric variant of $PacLPSolver$) that solves well-behaved covering LP problems, see Algorithm 2. It starts with the initial vector $\mathsf {x}_0 = \mathsf {y}_0 = x^{\mathsf {start}}$ and $\mathsf {z}_0 = 0$. Then, $CovLPSolver^{\mathsf {wb}}$ is divided into T iterations. In each iteration k, it computes a weighted midpoint $\mathsf {x}_k \leftarrow \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}$ for some parameter $\tau \in (0,1)$, and then proceeds to compute $\mathsf {y}_k$ and $\mathsf {z}_k$ as follows.

We select $i\in [n]$ uniformly at random. Let $\xi _k^{(i)} = (0,\dots ,0,-\mathbb {T}^\mathsf {p}(v),0,\dots ,0)$ be the vector that is only non-zero at coordinate i, where $v = - \nabla _i f_\mu (\mathsf {x}_k) \in [-1,\infty )$, and recall $\mathbb {T}^\mathsf {p}(v) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\min \{v, 1\}$. We refer to $\xi _k^{(i)}$ as the truncated gradient. Next,

Perform a mirror (descent) step$\mathsf {z}_k \leftarrow \mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}$ for some parameter $\alpha _k \ll 1/n$ to be chosen later.
Perform a gradient (descent) step$\mathsf {y}_k \leftarrow \mathsf {y}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mathsf {x}_k + \frac{1}{n \alpha _k L} (\mathsf {z}_k^{(i)} - \mathsf {z}_{k-1})$.

This finishes the description of $CovLPSolver^{\mathsf {wb}}$. It is not surprising to deduce the following theorem similar to Theorem 3.4. We include its proof in “Appendix C.3” for completeness.

Again, using the same proof as Corollary 3.5, one can apply Markov’s bound to turn Theorem 5.3 into a probabilistic statement:

Removing the well-behavior assumption In subsequent work, after the conference presentation of this paper, Wang, Mahoney and Rao [34] showed the following theorem.

Theorem 5.5

([34]) Any covering LP with constraint matrix $A \in \mathbb {R}^{m \times n}_{\ge 0}$ of sparsity N can be converted into an equivalent but well-behaved covering LP with matrix $\widetilde{A} \in \mathbb {R}^{m \times n \cdot O(\log (mn/\varepsilon ))}$ and sparsity $N \cdot O(\log (mn/\varepsilon ))$. The conversion takes time $N \cdot O(\log (mn/\varepsilon )).$

As a result, we can apply our covering solver $CovLPSolver^{\mathsf {wb}}$ to this modified LP and apply our Theorem 5.3 to solve any covering LP in expected time $ O(\frac{\log ^2(nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon } N).$

6 Our covering LP solver in the general case

In this section, we remove the well-behavior assumption and propose a different algorithm $CovLPSolver$ to solve all covering LP instances. This algorithm introduces a factor $1/\sqrt{\varepsilon }$ loss in the running time, but is a direct covering LP solver without using any reduction.

The main difference to $PacLPSolver$ and $CovLPSolver^{\mathsf {wb}}$ is that, this time we abandon the box constraint and study the minimization of $f_\mu (x)$ over a simplex

$$\begin{aligned} x\in \varDelta _{\mathsf {simplex}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ x \in \mathbb {R}^n \,:\, x_i \ge 0 \; \wedge \; {\mathbbm {1}}^{T} x\le 2 \mathsf {OPT}'\} . \end{aligned}$$

Again, this constraint ${\mathbbm {1}}^{T} x \le 2\mathsf {OPT}'$ is redundant just like the old $\varDelta _{\mathsf {box}}$ constraint for packing LP (recall Remark 2.5); however, it shall be used to make sure that our updates are always inside $\varDelta _{\mathsf {simplex}}$. It is a simple fact that

Fact 6.1

$u^*{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2)x^* \in \varDelta _{\mathsf {simplex}}$.

Recall that the initial vector $x^\sharp $ is defined in Definition 4.2, and $\mathsf {OPT}'$ is a crude approximation to $\mathsf {OPT}$, satisfying $\mathsf {OPT}' {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\mathbbm {1}}^{T} x^{\sharp } \in [\mathsf {OPT}, 2\mathsf {OPT}]$. We choose different starting vector $x^{\mathsf {start}}$ from Sect. 5:

Proposition 6.2

Letting $x^{\mathsf {start}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) \cdot x^{\sharp } + (\frac{1}{n},\dots ,\frac{1}{n})$, we have $x^{\mathsf {start}}\in \varDelta _{\mathsf {simplex}}$ and $f_\mu (x^{\mathsf {start}}) \le 4 \mathsf {OPT}$.

Proof

Using $Ax^{\mathsf {start}}- {\mathbbm {1}}\ge (1+\varepsilon /2) A x^{\sharp } - {\mathbbm {1}}\ge \varepsilon /2 \cdot {\mathbbm {1}}$, we compute $f_\mu (x^{\mathsf {start}})$ as follows:

$$\begin{aligned} f_\mu (x^{\mathsf {start}})&= \mu \sum _j e^{\frac{1}{\mu } (1 - (Ax^{\mathsf {start}})_j)} + {\mathbbm {1}}^{T} x^{\mathsf {start}}\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} + 2\mathsf {OPT}+ 1\\&\le \frac{\mu m}{(nm)^2} + 3\mathsf {OPT}< 4 \mathsf {OPT}. \end{aligned}$$

Also, we have ${\mathbbm {1}}^{T} x^{\mathsf {start}}\le (1+\varepsilon /2) \mathsf {OPT}'+1 \le 2 \mathsf {OPT}'$. (Recall $\mathsf {OPT}'\ge 2$ in Definition 4.2.) $\square $

Our proposed algorithm $CovLPSolver$ starts with the initial vector $\mathsf {x}_0 = \mathsf {y}_0 = \mathsf {z}_0 = x^{\mathsf {start}}$ and is divided into T iterations. In each iteration k, as usual, it computes a weighted midpoint $\mathsf {x}_k \leftarrow \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}$ for some parameter $\tau \in (0,1)$, and then computes $\mathsf {y}_k$ and $\mathsf {z}_k$ as follows.

We select $i\in [n]$ uniformly at random, and let $\xi _k^{(i)} = (0,\dots ,0,\mathbb {T}^\mathsf {c}(v),0,\dots ,0)$ be the vector that is only non-zero at coordinate i, where $v = \nabla _i f_\mu (\mathsf {x}_k) \in (-\infty ,1]$ and $\mathbb {T}^\mathsf {c}(v) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\max \{-\beta , v\}$ is the new thresholding function for some parameter $\beta {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\sqrt{\varepsilon }$. Then,

Perform a mirror (descent) step$\mathsf {z}_k \leftarrow \mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle (1+\gamma ) n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}$ for some positive parameters $\gamma \ll 1$ and $\alpha _k \ll 1/n$, where
$$\begin{aligned}\textstyle V_x(y) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\sum _{i=1}^n y_i \log \frac{y_i}{x_i} + x_i - y_i\end{aligned}$$
is the so-called Bregman divergence of the generalized entropy function (c.f. [12]).
If $\nabla _i f_\mu (x_k) < -\beta $, perform a gradient (descent) step$\mathsf {y}_k \leftarrow \mathsf {y}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mathsf {x}_k + \delta \mathbf {e}_i$ for some $\delta >0$. In practice, one can line-search over $\delta $, but we choose an explicit $\delta $ as follows.
- Denote by $\pi $ the permutation that satisfies $A_{\pi (1),i} \le \cdots \le A_{\pi (m),i}$
- Pick $j^* \in [m]$ s.t. $\left\{ \begin{array}{ll} \sum _{j<j^*} A_{\pi (j),i} \cdot p_{\pi (j)}(\mathsf {x}_k) < 1+\beta \\ \sum _{j\le j^*} A_{\pi (j),i} \cdot p_{\pi (j)}(\mathsf {x}_k) \ge 1+\beta \end{array} \right. \,.$ Such $j^*$ exists because
  $$\begin{aligned} \textstyle \sum _{j=1}^m A_{\pi (j),i} \cdot p_{\pi (j)}(\mathsf {x}_k) = \sum _{j=1}^m A_{ji} \cdot p_j(\mathsf {x}_k) = 1 - \nabla _i f_\mu (\mathsf {x}_k) \ge 1+\beta . \end{aligned}$$
  (6.1)
- Set $\delta = \frac{\mu \beta }{2 A_{\pi (j^*),i}}$.

This finishes the description of our $CovLPSolver$.

Remark 6.3

We use the superscript $^{(i)}$ on $\xi _k^{(i)}$, $\mathsf {y}_k^{(i)}$ and $\mathsf {z}_k^{(i)}$ to emphasize that the value depends on the choice of i. We have used generic parameters $\tau ,\alpha _k,T$ in the above description and their precise values are presented in Algorithm 3.

Our update on $\mathsf {y}_k$ is a “gradient descent step” because we shall prove that it strictly decreases the objective (i.e., $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\ge 0$). Our update on $\mathsf {z}_k$ is a “mirror descent step” because we shall apply standard mirror descent analysis [12] to it. We explicitly describe how to implement this mirror step: (proved in “Appendix D.4”)

Proposition 6.4

If $\mathsf {z}_{k-1}\in \varDelta _{\mathsf {simplex}}$ and $\mathsf {z}_{k-1}>0$, the minimizer $z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}$ for any scalar $\delta \in \mathbb {R}$ and basis vector $\mathbf {e}_i$ can be computed as follows:

1.
$z \leftarrow \mathsf {z}_{k-1}$.
2.
$z_i \leftarrow z_i \cdot e^{-\delta }$.
3.
If ${\mathbbm {1}}^{T} z > 2\mathsf {OPT}'$, $z \leftarrow \frac{2\mathsf {OPT}'}{{\mathbbm {1}}^{T} z} z$.
4.
Return z.

We also point out that

Lemma 6.5

Each iteration of $CovLPSolver$ can be implemented to run in expected O(N / n) time. The total expected running time is O(TN / n).

The proof of Lemma 6.5 is analogous to its packing counterpart, and included in Sect. F.6.

6.1 Main proof ideas and ingredients

In $CovLPSolver$, we pick a random coordinate $i\in [n]$ at each iteration, and decompose $\nabla _i f(\mathsf {x}_k) = \xi + \eta $, where $\eta \in (-\infty , 0]$ is the (negative) large gradient component and $\xi \in [-\sqrt{\varepsilon },1]$ is the truncated gradient component. In other words, we truncate the gradient $\nabla _i f(\mathsf {x}_k)$ at a negative threshold $-\beta = -\sqrt{\varepsilon }$, rather than at $-1$ as in $CovLPSolver^{\mathsf {wb}}$.

The reason for this new threshold $-\sqrt{\varepsilon }$ can be understood as follows. In $PacLPSolver$ (and symmetrically in $CovLPSolver^{\mathsf {wb}}$), we used Lemma 3.10 to show that our gradient descent step $\mathsf {y}_k$ decreases the objective by an amount that both includes the $\xi $ and $\eta $ components. Unfortunately, for covering LP, this decrease amount is only proportional to $\eta $ but not to $\xi $ (compare Lemma 3.10 with Lemma 6.14 later). This forces us to treat the small gradient $\xi $ separately using mirror descent, but not gradient descent.

If $\xi $ were in $[-1, 1]$, classical theory of mirror descent [12] (or multiplicative weight update [7]) would imply that the mirror step $\mathsf {z}_k$ converges at a rate $\propto \varepsilon ^{-2}$. This is too slow. Instead, since truncated $\xi $ into a smaller interval $[-\sqrt{\varepsilon },1]$, using a negative-width technique (see Sect. 6.5), we improve this mirror-descent convergence rate from $\varepsilon ^{-2}$ to $\varepsilon ^{-1.5}$.

On the other hand, due to this truncation at $-\sqrt{\varepsilon }$ instead of $-1$, our gradient step on $\mathsf {y}_k$ also converges slower, at a rate $1/\varepsilon ^{1.5}$ instead of $1/\varepsilon $. This is why $\beta = \sqrt{\varepsilon }$ is the best truncation threshold, as it balances gradient and mirror descent.

Another ingredient behind our proof is a new distance bound that is uncommon in first-order analysis. Recall that, given convex function g(x), traditional analysis applies convexity argument $g(x) - g(x^*) \le \langle \nabla g(x), x - x^* \rangle $ to bound the objective distance to optimum. If $g(x) = e^{-x}$ is univariate, $x = -1$, and $x^* = -100$, this bound becomes $e^{-1} \approx e^{-1} - e^{-100} \le e^{-1} \cdot 99$, which is very loose. In our analysis, we replace this convexity argument with a more benign bound, specifically designed for covering LP (see Lemma 6.10).

6.2 Convergence statement

The main convergence theorem of this section is as follows:

Again, using the same proof as Corollary 3.5, one can apply Markov’s bound to turn Theorem 5.3 into a probabilistic statement:

Before delving into the proof of Theorem 6.6, we make the following observations:

Fact 6.8

For every $k\in \{0,1,\dots ,T\}$, it satisfies $\mathsf {x}_k, \mathsf {y}_k \ge 0$, $\mathsf {z}_k > 0$, and $\mathsf {z}_k \in \varDelta _{\mathsf {simplex}}.$

Proof

Since the $x^{\mathsf {start}}$ satisfies ${\mathbbm {1}}^{T} x^{\mathsf {start}}\le 2\mathsf {OPT}'$ by Proposition 6.2, we have $\mathsf {z}_0 = x^{\mathsf {start}}\in \varDelta _{\mathsf {simplex}}$. Also, the mirror descent step (see Proposition 6.4) ensures $\mathsf {z}_{k,i} > 0$ for all rounds k and coordinates i, as well as $\mathsf {z}_k \in \varDelta _{\mathsf {simplex}}$ for all rounds k. However, we $\mathsf {x}_k$ and $\mathsf {y}_k$ may not necessarily lie inside $\varDelta _{\mathsf {simplex}}$, but will always stay non-negative. $\square $

We prove Theorem 6.6 in the subsequent subsections.

6.3 Step 1: distance adjustment

Using convexity, one can obtain

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (u) \le \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - u \rangle \quad \text {for every}\, u\in \varDelta _{\mathsf {simplex}}. \end{aligned}$$

(6.2)

Note that inequality (6.2) can be very loose for exponential functions. For instance, if $f_\mu (x)$ were as simple as $e^x$, then the convexity inequality $e^b - e^a \le e^b \cdot (b-a)$ says

when $b = 2$ and $a=-10$, we have $e^{2} - e^{-10} \le 12 e^2$;
when $b=2$ and $a=-100$, we have $e^2 - e^{-100} \le 102 e^2$.

Although $e^{-100} \approx e^{-10}$, the two upper bounds are off from each other by a factor of 10.

In this section, we strengthen (6.2) in the special case of $u = u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2)x^*$. For analysis purpose, let $\widetilde{A}$ be the adjusted matrix of A described as follows.

Definition 6.9

(adjusted matrix $\widetilde{A}$) For each row $j\in [m]$, if $(A u^*)_j \le 2$ then we keep it and let $\widetilde{A}_{j :} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}A_{j :}$. Otherwise,—that is, if $(A u^*)_j > 2$—we define $\widetilde{A}_{j :} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{2}{(A u^*)_j} \cdot A_{j :}$ to be the same j-th row $A_{j :}$, but scaled down by a factor of $\frac{2}{(A u^{*})_{j}}$. It is clear from this definition that

$$\begin{aligned} A_{ji} \ge \widetilde{A}_{ji}\quad {\text {for all}}\quad (i, j)\in [n] \times [m] \quad {\hbox {and}} \quad (1+\varepsilon ){\mathbbm {1}}\le {\widetilde{A}u^{*} \le 2}{{\mathbbm {1}}}. \end{aligned}$$

Lemma 6.10

(distance adjustment)

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (u^*)&\le \langle {\mathbbm {1}}- A^T p(\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle + \varepsilon \mathsf {OPT}\\&= \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle + \varepsilon \mathsf {OPT}\end{aligned}$$

At high level, ignoring the negligible term $\varepsilon \mathsf {OPT}$, Lemma 6.10 strengthens the classical bound due to the extra term of $\langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle $. This extra term is always non-positive since $\widetilde{A}\le A$ coordinate-wise, but may be very negative in certain cases.

Proof of Lemma 6.10

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (u^*) = \mu \sum _{j=1}^m \left( e^{\frac{1}{\mu } (1-(A\mathsf {x}_k)_j)} - e^{\frac{1}{\mu } (1-(A u^*)_j)}\right) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }\mu \sum _{j=1}^m \left( e^{\frac{1}{\mu } (1-(A\mathsf {x}_k)_j)} - e^{\frac{1}{\mu } (1-(\widetilde{A}u^*)_j)}\right) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \mu \cdot m \cdot e^{-1/\mu }\\&\quad \overset{\textcircled {{\small 2}}}{\le }\sum _{j=1}^m e^{\frac{1}{\mu } (1-(A\mathsf {x}_k)_j)} \cdot \big ( (\widetilde{A}u^*)_j - (A \mathsf {x}_k)_j \big ) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \varepsilon \mathsf {OPT}\\&\quad = \sum _{j=1}^m p_j(\mathsf {x}_k) \cdot \big ( (\widetilde{A}u^*)_j - (A \mathsf {x}_k)_j \big ) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \varepsilon \mathsf {OPT}\\&\quad = \sum _{j=1}^m p_j(\mathsf {x}_k) \cdot \big ( (A u^*)_j - (A \mathsf {x}_k)_j \big ) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle \\&\qquad + \sum _{j=1}^m p_j(\mathsf {x}_k) \cdot \big ( (\widetilde{A}u^*)_j - (A u^*)_j \big ) + \varepsilon \mathsf {OPT}\\&\quad = \langle - A^T p(\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle + \varepsilon \mathsf {OPT}. \end{aligned}$$

Above, $\textcircled {{\small 1}}$is because if $(A u^*)_j \ne (\widetilde{A}u^*)_j$ for some j, then it must satisfy that $(\widetilde{A}u^*)_j = 2$, and therefore $-e^{\frac{1}{\mu }(1-(A u^*)_j)} \le -e^{\frac{1}{\mu }(1-(\widetilde{A}u^*)_j)} + e^{-1/\mu }$. $\textcircled {{\small 2}}$uses the convexity inequality of $e^b - e^a \le e^b \cdot (b-a)$, and the fact that $\mu m e^{-1/\mu } \ll \varepsilon \mathsf {OPT}$.

6.4 Step 2: gradient truncation

For analysis purpose, let us separate the indices $i\in [n]$ into large and small ones.

Definition 6.11

We make the following definitions.

Let $B_k {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ i \in [n] \,:\, \nabla _i f_\mu (\mathsf {x}_k) < -\beta \}$ and $[n] \setminus B_k$ be the set of large and small indices .
Let $\xi _{k} \in [-\beta ,1]^n$ be the truncated gradient so that $\xi _{k,i} = \mathbb {T}^\mathsf {c}(\nabla _i f_\mu (x_k))$ for each $i\in [n]$.
Let $\eta _{k} \in (-\infty ,0]^n$ be the large gradient so that $\nabla f_\mu (\mathsf {x}_k) = \xi _k + \eta _k$. It is clear that
$$\begin{aligned} \eta _{k,i} = 0\quad \text {for every}\quad i\not \in B, \quad \text {and}\quad \eta _{k,i} = (1+\beta ) - (A^T p(\mathsf {x}_k))_i\quad \text {for every}\quad i\in B. \end{aligned}$$
Let $\widetilde{\eta }_k \in (-\infty ,\infty )^n$ be the adjusted large gradient so that
$$\begin{aligned} \widetilde{\eta }_{k,i} = 0\quad \text {for every}\quad i\not \in B, \quad \text {and}\quad \widetilde{\eta }_{k,i} = (1+\beta ) - (\widetilde{A}^T p(\mathsf {x}_k))_i\quad \text {for every}\quad i\in B. \end{aligned}$$

We denote by $\eta _k^{(i)} = (0,\dots ,0,\eta _{k,i},0,\dots ,0)$, the vector that is zero at all coordinates other than i, and similarly $\xi _k^{(i)} = (0,\dots ,\xi _{k,i},\dots ,0)$ and $\widetilde{\eta }_k^{(i)}=(0,\dots ,\widetilde{\eta }_{k,i},\dots ,0)$. We emphasize that $\eta _k^{(i)} \ne \eta _{k}$, $\widetilde{\eta }_k^{(i)} \ne \widetilde{\eta }_{k}$, and $\xi _k^{(i)} \ne \xi _{k}$.

The following key lemma is very analogous to (3.1) in the packing LP analysis.

Lemma 6.12

(distance upper bound)

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (u^*) \le \frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \mathbf {E}_i \Big [ \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle \Big ] \\&\qquad +\,\mathbf {E}_{i} \Big [ \langle n \widetilde{\eta }_k^{(i)}, - u^* \rangle \Big ] + \varepsilon \mathsf {OPT}. \end{aligned}$$

Note that if one uses $\eta _k^{(i)}$ instead of $\widetilde{\eta }_k^{(i)}$, then Lemma 6.12 becomes trivial to prove just like (3.1). The reason we can have the stronger term $\widetilde{\eta }_k^{(i)}$ is precisely due to the distance adjustment Lemma 6.10.

Proof of Lemma 6.12

We derive the following sequence of inequalities:

$$\begin{aligned}&\big (f_\mu (\mathsf {x}_k) - f_\mu (u^*)\big ) - \varepsilon \mathsf {OPT}\\&\quad \overset{\textcircled {{\small 1}}}{\le }\langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad = \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \rangle + \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad \overset{\textcircled {{\small 2}}}{=} \frac{(1-\tau )}{\tau } \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {y}_{k-1} - \mathsf {x}_{k} \rangle + \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad \overset{\textcircled {{\small 3}}}{\le }\frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad = \frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \xi _k + \eta _k, \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }\frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \xi _k, \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k) - \eta _k, u^* \rangle \\&\quad \overset{\textcircled {{\small 5}}}{\le }\frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \xi _k, \mathsf {z}_{k-1} - u^* \rangle + \langle - \widetilde{\eta }_k, u^* \rangle \\&\quad = \frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \mathbf {E}_i \Big [ \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle + \langle - n \widetilde{\eta }_k^{(i)}, u^* \rangle \Big ] . \end{aligned}$$

Above, $\textcircled {{\small 1}}$is due to Lemma 6.10. $\textcircled {{\small 2}}$is because $\mathsf {x}_{k} = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}$, which implies that $\tau (\mathsf {x}_k - \mathsf {z}_{k-1}) = (1-\tau ) (\mathsf {y}_{k-1} - \mathsf {x}_k)$. $\textcircled {{\small 3}}$is by the convexity of $f_\mu (\cdot )$. $\textcircled {{\small 4}}$is because $\langle \eta _{k}, \mathsf {z}_{k-1} \rangle \le 0$, since $\eta _{k} \le 0$ while $\mathsf {z}_{k-1}\ge 0$. $\textcircled {{\small 5}}$needs some careful justification: for every $i\not \in B_k$, we have $(\widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k))_i - \eta _{k,i} \le 0 - 0 = - \widetilde{\eta }_{k,i}$; for every $i\in B_k$, we have

$$\begin{aligned} (\widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k))_i - \eta _{k,i}= & {} (\widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k))_i - \big ( (1+\beta ) - (A^T p(\mathsf {x}_k))_i \big ) \\= & {} - \big ( (1+\beta ) - (\widetilde{A}^T p(\mathsf {x}_k))_i \big ) = - \widetilde{\eta }_{k,i} , \end{aligned}$$

where the two equalities follow from the definitions of $\eta _{k,i}$ and $\widetilde{\eta }_{k,i}$.

6.5 Step 3: mirror descent guarantee

Our update $\mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle (1+\gamma ) n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}$ is, by its definition, a mirror descent step [12]. We begin by explaining an attempt that is too weak for obtaining the $\varepsilon ^{-1.5}$ convergence rate.

Using the classical theory, it is not hard to repeat the proof of Lemma 3.7—although changing the distance function from $\Vert \cdot \Vert _A^2$ to $V_x(y)$—and obtain that, as long as $\xi _{k,i}$ is in $[-1, +1]$ for each coordinate i, for every $u \in \varDelta _{\mathsf {simplex}}$,

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}(u) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}(u) \Big ] + O(\alpha _k^2 n ) \mathsf {OPT}. \end{aligned}$$

This inequality only yields a slower $\varepsilon ^{-2}$ convergence rate, and $\pm 1$ is also know as the width parameter from the multiplicative-weight-update language [7].

In our lemma below, we make use of the fact $\xi _{k,i}$ is in $[-\beta , +1] \subseteq [-1,+1]$. In essence, this allows us to replace the $O(\alpha _k^2 n)$ factor with a better $O(\alpha _k^2 \beta n)$ factor. We call it the negative-width technique.^{Footnote 12} Formally,

Lemma 6.13

(mirror descent) Denoting by $\gamma {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}2\alpha _T n$, we have

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta . \end{aligned}$$

The proof is somewhat technical and included in “Appendix D.4”.

6.6 Step 4: gradient descent guarantee

We show our gradient step never increases the objective for all choices of i. In addition, it decreases the objective by an amount proportional to the adjusted large gradient $\widetilde{\eta }_k^{(i)}$.

Lemma 6.14

(gradient descent) For every $i\in [n]$, we have

(a)
$f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0$, and
(b)
$f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{\mu \beta }{12} \cdot \langle - \widetilde{\eta }_{k}^{(i)}, u^* \rangle .$

The proof of Lemma 6.14 is quite technical and can be found in “Appendix D.4”.

At high level, one would hope to prove that the gradient step decreases the objective by an amount proportional to the large gradient $\eta _k^{(i)}$, rather than the adjusted large gradient $\widetilde{\eta }_k^{(i)}$. If that were true, the entire proof structure of our covering LP convergence would become much closer to that of packing LP, and there would be absolutely no need for the introduction of the distance adjustment in Sect. 6.3, as well as the definitions of $\widetilde{A}$ and $\widetilde{\eta }$.

Unfortunately, if one replaces $\widetilde{\eta }$ with $\eta $ in the above lemma, the inequality is false. The reason behind it is very similar to what we have summarized in Sect. 6.3, and related to the fact that the exponential penalty function is not Lipschitz smooth.

6.7 Step 5: putting all together

Combining Lemma 6.12, Lemma 6.13, and Lemma 6.14, we obtain that

$$\begin{aligned}&\alpha _k \big (f_\mu (\mathsf {x}_k) - f_\mu (u^*)\big ) - \alpha _k \varepsilon \mathsf {OPT}\\&\quad \le \frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \mathbf {E}_i \Big [ \alpha _k \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle \Big ]\\&\qquad +\mathbf {E}_{i} \Big [ \alpha _k \langle n \widetilde{\eta }_k^{(i)}, - u^* \rangle \Big ] \\&\quad \le \frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] \\&\qquad + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta + \mathbf {E}_i \Big [ \frac{12\alpha _k n}{\mu \beta } \big (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\big ) \Big ] \end{aligned}$$

Remark 6.15

Above, the quantity “$12 \mathsf {OPT}\cdot \gamma \alpha _k \beta $” is the loss term introduced by the mirror descent. Unlike the packing LP case—see (3.2)—this loss term is not dominated by the gradient step. (If one could do so, this would give $CovLPSolver$ an $\varepsilon ^{-1}$ convergence rate.)

The quantity “$\alpha _k \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle $” is the loss introduced by the (adjusted) large gradient $\widetilde{\eta }$, and is dominated by our gradient step progress owing to Lemma 6.14. This is similar to the packing LP case—see Lemma 3.10.

From here, let us use the special choice of $\tau = \frac{\mu \beta }{12n}$. We obtain that

$$\begin{aligned}&- \alpha _k \big (f_\mu (u^*) + \varepsilon \mathsf {OPT}\big ) \\&\quad \le 12\gamma \alpha _k \beta \mathsf {OPT}+ \frac{(1-\tau )\alpha _k}{\tau } f_\mu (\mathsf {y}_{k-1}) + V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big )\\&\qquad - \mathbf {E}_i \Big [ \frac{\alpha _k}{\tau } f_\mu (\mathsf {y}_k^{(i)}) + V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] . \end{aligned}$$

Using the choice $\alpha _k = \frac{\alpha _{k-1}}{1-\tau }$ and telescoping the above inequality for $k=1,\dots ,T$, we have

$$\begin{aligned} -\big (\sum _{k=1}^T \alpha _k \big ) \big (f_\mu (u^*) + \varepsilon \mathsf {OPT}\big )&\le \big (\sum _{k=1}^T \alpha _k\big ) \cdot 12 \gamma \beta \mathsf {OPT}+ \frac{\alpha _0}{\tau } f_\mu (\mathsf {y}_0)\\&\quad + V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) - \frac{\alpha _T}{\tau } \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]. \end{aligned}$$

We compute that $\sum _{k=1}^T \alpha _k = \alpha _T \cdot \sum _{k=0}^{T-1} (1-\tau )^k = \alpha _T \cdot \frac{1-(1-\tau )^T}{\tau } < \frac{\alpha _T}{\tau }$, and recall that $\gamma = 2\alpha _T n$. Therefore, we rearrange and get

$$\begin{aligned} \frac{\alpha _T}{\tau } \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]&\le \frac{\alpha _T}{\tau } \big (f_\mu (u^*) + \varepsilon \mathsf {OPT}\big ) + \frac{\alpha _T}{\tau } \cdot 12 \gamma \beta \mathsf {OPT}+ \frac{\alpha _0}{\tau } f_\mu (\mathsf {y}_0) \nonumber \\&\quad + V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) , \nonumber \\ \implies \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]&\le f_\mu (u^*) + \varepsilon \mathsf {OPT}+ 24 \alpha _T n \beta \mathsf {OPT}+ (1-\tau )^T f_\mu (\mathsf {y}_0) \nonumber \\&\quad + \frac{\tau }{\alpha _T} V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) . \end{aligned}$$

(6.3)

From this point, we need to use our special choice of the initial point $\mathsf {x}_0 = \mathsf {y}_0 = \mathsf {z}_0 = x^{\mathsf {start}}$ (see Proposition 6.2), which implies that $f_\mu (\mathsf {y}_0) \le 4\mathsf {OPT}$ and ${\mathbbm {1}}^{T} x^{\mathsf {start}}\le 4\mathsf {OPT}$. We also have

$$\begin{aligned} V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) = V_{x^{\mathsf {start}}}\big (\frac{u^*}{1+\gamma }\big )&= \sum _{i=1}^n \frac{u^*_i}{1+\gamma } \log \frac{u^*_i}{(1+\gamma ) x^{\mathsf {start}}_i} + x^{\mathsf {start}}_{i} - \frac{u^*_i}{1+\gamma } \\&\overset{\textcircled {{\small 1}}}{\le }\sum _{i=1}^n u^*_i \log (u^*_i \cdot n) + 4\mathsf {OPT}\overset{\textcircled {{\small 2}}}{\le }(2\log (nm)+4) \cdot \mathsf {OPT}. \end{aligned}$$

Above, inequality $\textcircled {{\small 1}}$follows because $x^{\mathsf {start}}_i \ge 1/n$ for all $i\in [n]$ according to the definition in Proposition 6.2; inequality $\textcircled {{\small 2}}$follows because each $u^*_i \le (1+\varepsilon /2) x^*_i \le (1+\varepsilon /2)\mathsf {OPT}\le (1+\varepsilon /2)m$ and ${\mathbbm {1}}^{T} u^*_i = (1+\varepsilon /2)\mathsf {OPT}$, as well as the fact that $\varepsilon $ is sufficiently small.

Finally, we choose $\beta = \sqrt{\varepsilon }$, $T = \lceil \frac{1}{\tau } \log (1/\varepsilon ) \rceil $, and $\alpha _0$ such that $\alpha _T = \frac{\varepsilon }{12n\beta }$. Substituting into (6.3) all of these parameters, along with the aforementioned inequalities $f_\mu (\mathsf {y}_0) \le 4\mathsf {OPT}$ and $V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) \le (2\log (nm)+4) \cdot \mathsf {OPT}$, as well as $f_\mu (u^*) \le (1+\varepsilon )\mathsf {OPT}$ from Proposition 4.5.a, we obtain that

$$\begin{aligned} \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]\le & {} (1+\varepsilon )\mathsf {OPT}+ \varepsilon \mathsf {OPT}+ 2\varepsilon \mathsf {OPT}+ \varepsilon f_\mu (\mathsf {y}_0)\\&+ \frac{\mu \beta /12n}{\varepsilon /12n\beta } (2\log (nm)+4)\mathsf {OPT}\\\le & {} (1+9\varepsilon ) \mathsf {OPT}. \end{aligned}$$

This finishes the proof of Theorem 6.6. $\square $

Notes

Luby and Nisan, who originally studied iterative solvers for this class of problems [24], dubbed them positive LPs. However, the class of LPs with non-negative constraint matrices is slightly larger, including mixed-packing-and-covering LPs. For this reason, we prefer to stick to the PC-LP terminology.
Most width-dependent solvers study the minmax problem $\min _{\begin{array}{c} x\ge 0 , {\mathbbm {1}}^{T} x = 1 \end{array}} \; \max _{\begin{array}{c} y\ge 0 , {\mathbbm {1}}^{T} y = 1 \end{array}} \; y^T A x ,$ whose optimal value equals $1/\mathsf {OPT}$. Their approximation guarantees are often written in terms of additive error. We have translated their performances to multiplicative error for a clear comparison.
Some of these solvers still have a $\textsf {polylog}(\rho )$ dependence. Since each occurrence of $\log (\rho )$ can be replaced with $\log (nm)$ after slightly modifying the matrix A, we have done so in Table 1 for a fair comparisons.
This can be verified by observing that our objective $f_\mu (x)$, to be introduced later, is not globally Lipschitz smooth, so that one cannot apply accelerated gradient descent directly.
Due to space limitation, we quickly sketch why logarithmic word size suffices for our algorithms. On one hand, one can prove in an iteration, if x is calculated with a small additive error $1/\textsf {poly}(1/\varepsilon ,n,m)$, then the objective f(x) may increase only by $1/\textsf {poly}(1/\varepsilon ,n,m)$ in that iteration. The proof of this relies on the fact that (1) one can assume without loss of generality all entries of A are no more than $\textsf {poly}(1/\varepsilon ,n,m)$ and (2) our algorithms ensure $f(x) < \textsf {poly}(1/\varepsilon , n, m)$ for all iterations with high probability, so even though we are using the exponential functions, f(x) will not change additively by much. On the other hand, one can similarly prove that each $\nabla _i f(x)$ can be calculated within an additive error $1/\textsf {poly}(1/\varepsilon ,n,m)$ in each iteration. They together imply that the total error incurred by arithmetic operations can be made negligible.
If $\min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\} = 0$ then the packing LP is unbounded so we are done. Otherwise, if $\min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\} = v > 0$ we scale all entries of A by 1 / v, and scale $\mathsf {OPT}$ by v.
Note that some of the previous results (such as [7, 31]) appear to directly minimize $\sum _{j=1}^m e^{((Ax)_j - 1)/\mu }$ as opposed to its logarithm g(x). However, their per-iteration objective decrease is multiplicative, meaning it is essentially equivalent to performing a single gradient-descent step on g(x) with additive objective decrease.
The exact same $f_\mu (x)$ also appeared in our previous work [4], albeit without this smoothing interpretation and without the constraint $ x \in \varDelta _{\mathsf {box}}.$ The techniques in [4] only leads to $\varepsilon ^{-2}$ convergence (see Table 1).
A similar gradient truncation was developed in our prior work [4], but for a different purpose (to ensure parallelism) and not applied to coordinate gradient. The truncation idea of this paper also inspired later works in matrix scaling [2] and in SDP [1].
If $\min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\} = 0$ then the covering LP is infeasible so we are done. Otherwise, if $\min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\} = v > 0$ we scale all entries of A by 1 / v, and scale $\mathsf {OPT}$ by v.
The constant 9 in this section can be replaced with any other constant greater than 1.
This negative width technique is related to [7, Definition 3.2], where the authors analyze the multiplicative weight update method in a special case when the oracle returns loss values only in $[-\ell , +\rho ]$, for some $\ell \ll \rho $. This technique is also a sub-case of a more general theory of mirror descent, known as the local-norm convergence, that we have summarized in a separate and later paper [3].
We wish to point out that this proof coincides with a lemma from the accelerated coordinate descent theory of Fercoq and Richtárik [17]. Their paper is about optimizing an objective function that is Lipschitz smooth, and thus irrelevant to our work.
This is because, our parameter choices ensure that $(1+\gamma )\alpha _k n < 1/2\beta $, which further means $-(1+\gamma )\alpha _k n\xi _{k,i}^{(i)} \le 1/2$. As a result, we must have $\mathsf {z}_{k,i}^{(i)} \le \mathsf {z}_{k-1,i} \cdot e^{0.5} < 2 \mathsf {z}_{k-1,i}$ (see the explicit definition of the mirror step at Proposition 6.4).

References

Allen-Zhu, Z., Lee, Y.T., Orecchia, L.: Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In: SODA (2016)
Allen-Zhu, Z., Li, Y., Oliveira, R., Wigderson, A: Much faster algorithms for matrix scaling. In: FOCS, 2017. arXiv:1704.02315
Allen-Zhu, Z., Liao, Z., Orecchia, L.: Spectral sparsification and regret minimization beyond multiplicative updates. In: STOC (2015)
Allen-Zhu, Z., Orecchia, L.: Using optimization to break the epsilon barrier: a faster and simpler width-independent algorithm for solving positive linear programs in parallel. In: SODA (2015)
Allen-Zhu, Z., Orecchia, L.: Linear coupling: an ultimate unification of gradient and mirror descent. In: ITCS (2017)
Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. In: ICML (2016)
Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8, 121–164 (2012)
Article MathSciNet MATH Google Scholar
Awerbuch, B., Khandekar, R.: Stateless distributed gradient descent for positive linear programs. In: STOC (2008)
Awerbuch, B., Khandekar, R., Rao, S.: Distributed algorithms for multicommodity flow problems via approximate steepest descent framework. ACM Trans. Algorithms 9(1), 1–14 (2012)
Article MathSciNet MATH Google Scholar
Bartal, Y., Byers, J.W., Raz, D.: Global optimization using local information with applications to flow control. In: Proceedings 38th Annual Symposium on Foundations of Computer Science, pp. 303–312. IEEE Computer Society (1997)
Bartal, Y., Byers, J.W., Raz, D.: Fast, distributed approximation algorithms for positive linear programming with applications to flow control. SIAM J. Comput. 33(6), 1261–1279 (2004)
Article MathSciNet MATH Google Scholar
Ben-Tal, A., Arkadi, N.: Lectures on modern convex optimization. Soc. Ind. Appl. Math. 315–341 (2013)
Bienstock, D., Iyengar, G.: Faster approximation algorithms for packing and covering problems. Technical report, Columbia University, September 2004. Preliminary version published in STOC ’04
Byers, J., Nasser, G.: Utility-based decision-making in wireless sensor networks. In: 2000 first annual workshop on mobile and ad hoc networking and computing, 2000. MobiHOC, pp. 143–144. IEEE (2000)
Chudak, F.A., Eleutério, V. : Improved approximation schemes for linear programming relaxations of combinatorial optimization problems. In: Proceedings of the 11th International IPCO Conference on Integer Programming and Combinatorial Optimization, pp. 81–96 (2005)
Duan, R., Pettie, S.: Linear-time approximation for maximum weight matching. J. ACM 61(1), 1–23 (2014)
Article MathSciNet MATH Google Scholar
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Article MathSciNet MATH Google Scholar
Fleischer, L.K.: Approximating fractional multicommodity flow independent of the number of commodities. SIAM J. Discrete Math. 13(4), 505–520 (2000)
Article MathSciNet MATH Google Scholar
Garg, N., Könemann, J.: Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37(2), 630–652 (2007)
Article MathSciNet MATH Google Scholar
Grigoriadis, M.D., Khachiyan, L.G.: Fast approximation schemes for convex programs with many blocks and coupling constraints. SIAM J. Optim. 4(1), 86–107 (1994)
Article MathSciNet MATH Google Scholar
Jain, R., Ji, Z., Upadhyay, S., Watrous, J.: QIP = PSPACE. J. ACM JACM 58(6), 30 (2011)
MathSciNet Google Scholar
Klein, P., Young, N.E.: On the number of iterations for Dantzig–Wolfe optimization and packing-covering approximation algorithms. SIAM J. Comput. 44(4), 1154–1172 (2015)
Article MathSciNet MATH Google Scholar
Koufogiannakis, C., Young, N.E.: A nearly linear-time PTAS for explicit fractional packing and covering linear programs. Algorithmica 70, 494–506 (2013). (Previously appeared in FOCS ’07)
MathSciNet MATH Google Scholar
Luby, M., Nisan, N.: A parallel approximation algorithm for positive linear programming. In: STOC, pp. 448–457. ACM Press, New York (1993)
Madry, A.: Faster approximation schemes for fractional multicommodity flow problems via dynamic graph algorithms. In: STOC. ACM Press, New York (2010)
Nemirovski, A.: Prox-method with rate of convergence $O(1/t)$ for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Rounding of convex sets and efficient gradient methods for linear programming problems. Optim. Methods Softw. 23(1), 109–128 (2008)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Sov. Math. Dokl. 269, 543–547 (1983)
Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim. 22(2), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Plotkin, S.A., Shmoys, D.B., Tardos, É.: Fast approximation algorithms for fractional packing and covering problems. Math. Oper. Res. 20(2), 257–301 (1995). (conference version published in FOCS 1991)
Article MathSciNet MATH Google Scholar
Trevisan, L.: Parallel approximation algorithms by positive linear programming. Algorithmica 21(1), 72–88 (1998)
Article MathSciNet MATH Google Scholar
Wang, D., Mahoney, M., Mohan, N., Rao, S.: Faster parallel solver for positive linear programs via dynamically-bucketed selective coordinate descent. ArXiv e-prints arXiv:1511.06468 (2015)
Wang, D., Rao, S., Mahoney, M.W.: Unified acceleration method for packing and covering problems via diameter reduction. In: ICALP (2016)
Young, N.E.: Sequential and parallel algorithms for mixed packing and covering. In: 42nd Annual IEEE Symposium on Foundations of Computer Science (FOCS’01), pp. 538–546. IEEE Computer Society (2001)
Young, N.E.: Nearly linear-time approximation schemes for mixed packing/covering and facility-location linear programs. ArXiv e-prints arXiv:1407.3015 (2014)
Zurel, E., Nisan, N.: An efficient approximate allocation algorithm for combinatorial auctions. In: Proceedings of the 3rd ACM Conference on Electronic Commerce, pp. 125–136. ACM (2001)

Download references

Author information

Authors and Affiliations

Microsoft Research AI, Redmond, WA, USA
Zeyuan Allen-Zhu
Boston University, Boston, MA, USA
Lorenzo Orecchia

Authors

Zeyuan Allen-Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Orecchia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeyuan Allen-Zhu.

Additional information

An earlier version of this paper appeared on arXiv http://arxiv.org/abs/1411.1124 in November 2014. A 6-paged abstract of this paper “Nearly-Linear Time Positive LP Solver with Faster Convergence Rate,” excluding Sects. 4, 5 and 6, and excluding all the proofs in Sects. 2 and 3, has been presented at the STOC 2015 conference in Portland, OR.

Appendix

1.1 Proof of Lemma 3.6

Lemma 3.6

We have $\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}$ for all $k=0,1,\dots ,T$.

Proof

This is true at the beginning as $\mathsf {x}_0 = \mathsf {y}_0 = x^{\mathsf {start}}\in \varDelta _{\mathsf {box}}$ (see Fact 2.8) and $\mathsf {z}_0 = 0 \in \varDelta _{\mathsf {box}}$. In fact, it suffices for us to show that for every $k\ge 1$, $\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l$ for some scalars $\gamma _k^l$ satisfying $\sum _l \gamma _k^l = 1$ and $\gamma _k^l \ge 0$ for each $l = 0,\dots ,k$. If this is true, we can prove the lemma by induction: at each iteration $k \ge 1$,

1.
$\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}$ must be in $\varDelta _{\mathsf {box}}$ because $\mathsf {y}_{k-1}$ and $\mathsf {z}_{k-1}$ are and $\tau \in [0,1]$,
2.
$\mathsf {z}_k$ is in $\varDelta _{\mathsf {box}}$ by the definition that $\mathsf {z}_k = {\text {arg min}}_{z\in \varDelta _{\mathsf {box}}}\{\cdots \}$, and
3.
$\mathsf {y}_k$ is also in $\varDelta _{\mathsf {box}}$ because $\mathsf {y}_k= \sum _{l=0}^k \gamma _k^l \mathsf {z}_l$ is a convex combination of the $\mathsf {z}_l$’s and $\varDelta _{\mathsf {box}}$ is convex.

For the rest of the proof, we show that $\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l$ for every $k\ge 1$ with coefficients^{Footnote 13}

$$\begin{aligned} \gamma _{k}^l = \left\{ \begin{array}{ll} (1-\tau )\gamma _{k-1}^l, &{}\quad l = 0,\ldots ,k-2; \\ \big (\frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\big ) + \tau \big (1 - \frac{1}{n \alpha _{k-1} L}\big ), &{}\quad l=k-1; \\ \frac{1}{n\alpha _k L}, &{}\quad l=k. \end{array} \right. \end{aligned}$$

This is true at the base case $k=1$ because $\mathsf {y}_1 = \mathsf {x}_1 + \frac{1}{n\alpha _1 L}(\mathsf {z}_1 - \mathsf {z}_0) = \frac{1}{n\alpha _1 L} \mathsf {z}_1 + \big (1-\frac{1}{n\alpha _1 L}\big ) \mathsf {z}_0$. For the general $k\ge 2$, we have

$$\begin{aligned} \mathsf {y}_k&= \mathsf {x}_k + \frac{1}{n \alpha _k L}(\mathsf {z}_k - \mathsf {z}_{k-1}) \\&= \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{1}{n \alpha _k L}(\mathsf {z}_k - \mathsf {z}_{k-1}) \\&= \tau \mathsf {z}_{k-1} + (1-\tau ) \left( \sum _{l=0}^{k-2} \gamma _{k-1}^l \mathsf {z}_l + \frac{1}{n \alpha _{k-1} L} \mathsf {z}_{k-1}\right) + \frac{1}{n \alpha _k L}(\mathsf {z}_k - \mathsf {z}_{k-1}) \\&= \left( \sum _{l=0}^{k-2} (1-\tau ) \gamma _{k-1}^l \mathsf {z}_l \right) + \bigg (\left( \frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\right) + \tau \left( 1 - \frac{1}{n \alpha _{k-1} L}\right) \bigg ) \mathsf {z}_{k-1} \\&\quad + \frac{1}{n\alpha _k L} \mathsf {z}_k . \end{aligned}$$

Therefore, we obtain $\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l$ as desired.

It is now easy to check that under our definition of $\alpha _k$ (which satisfies $\alpha _k\ge \alpha _{k-1}$ and $\alpha _k\ge \alpha _0 = \frac{1}{nL}$, we must have $\gamma _k^l \ge 0$ for all k and l. Also,

$$\begin{aligned} \sum _l \gamma _k^l&= \sum _{l=0}^{k-2} (1-\tau ) \gamma _{k-1}^l + \bigg (\left( \frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\right) + \tau \left( 1 - \frac{1}{n \alpha _{k-1} L}\right) \bigg ) + \frac{1}{n\alpha _k L} \\&= (1-\tau ) \left( 1-\frac{1}{n\alpha _{k-1}L}\right) + \bigg (\left( \frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\right) + \tau \left( 1 - \frac{1}{n \alpha _{k-1} L}\right) \bigg ) \\&\quad + \frac{1}{n\alpha _k L} = 1 . \end{aligned}$$

$\square $

1.2 Proof of Proposition 4.5

Proposition 4.5

(a)
$f_\mu (u^*) \le (1+\varepsilon )\mathsf {OPT}$ for $u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) x^*$.
(b)
$f_\mu (x) \ge (1-\varepsilon )\mathsf {OPT}$ for every $x \ge 0$.
(c)
For any $x \ge 0$ satisfying $f_\mu (x) \le 2\mathsf {OPT}$, we must have $Ax \ge (1-\varepsilon ){\mathbbm {1}}$.
(d)
If $x\ge 0$ satisfies $f_\mu (x) \le (1+\delta )\mathsf {OPT}$ for some $\delta \in [0,1]$, then $\frac{1}{1-\varepsilon } x$ is a $\frac{1+\delta }{1-\varepsilon }$-approximate solution to the covering LP.

Proof

(a)
We have ${\mathbbm {1}}^{T} u^* = (1+\varepsilon /2)\mathsf {OPT}$ by the definition of $\mathsf {OPT}$. Also, from the feasibility constraint $A x^* \ge {\mathbbm {1}}$ in the covering LP, we have $A u^* - {\mathbbm {1}}\ge \varepsilon /2 \cdot {\mathbbm {1}}$, and can compute $f_\mu (u^*)$ as follows:
$$\begin{aligned} f_\mu (u^*) = \mu \sum _j e^{\frac{1}{\mu } (1 - (A u^*)_j)} + {\mathbbm {1}}^{T} u^*&\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} + (1+\varepsilon /2)\mathsf {OPT}\\&\le \frac{\mu m}{(nm)^2} + (1+\varepsilon /2)\mathsf {OPT}\le (1+\varepsilon )\mathsf {OPT}. \end{aligned}$$
(b)
Suppose towards contradiction that $f_\mu (x) < (1-\varepsilon )\mathsf {OPT}$. Since $f_\mu (x) <\mathsf {OPT}\le m$, we must have that for every $j\in [m]$, it satisfies that $e^{\frac{1}{\mu }(1-(Ax)_j)} \le f_\mu (x) / \mu \le m / \mu $. This further implies $(Ax)_j \ge 1-\varepsilon $ by the definition of $\mu $. In other words, $Ax \ge (1-\varepsilon ){\mathbbm {1}}$. By the definition of $\mathsf {OPT}$, we must then have ${\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}$, finishing the proof that $f_\mu (x) \ge {\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}$, giving a contradiction.
(c)
To show $Ax \ge (1-\varepsilon ){\mathbbm {1}}$, we can assume that $v = \max _j (1 - (Ax)_j) > \varepsilon $ because otherwise we are done. Under this definition, we have
$$\begin{aligned} \textstyle f_\mu (x) \ge \mu e^{\frac{v}{\mu }} = \mu \big ((\frac{nm}{\varepsilon })^4\big )^{v/\varepsilon } \ge \frac{\varepsilon }{4\log (nm/\varepsilon )} (\frac{nm}{\varepsilon })^4 \gg 2\mathsf {OPT}, \end{aligned}$$
contradicting to our assumption that $f_\mu (x) \le 2\mathsf {OPT}$. Therefore, we must have $v \le \varepsilon $, that is, $Ax \ge (1-\varepsilon ){\mathbbm {1}}$.
(d)
For any x satisfying $f_\mu (x) \le (1+\theta )\mathsf {OPT}\le 2 \mathsf {OPT}$, owing to Proposition 4.5c, we first have that x is approximately feasible, i.e., $Ax \ge (1-\varepsilon ){\mathbbm {1}}$. Next, because ${\mathbbm {1}}^{T} x \le f_\mu (x) \le (1+\theta )\mathsf {OPT}$, we know that x yields an objective ${\mathbbm {1}}^{T} x \le (1+\theta )\mathsf {OPT}$. Letting $x' = \frac{1}{1-\varepsilon } x$, we both have that $x'$ is feasible (i.e., $Ax' \ge {\mathbbm {1}}$), and $x'$ has an objective ${\mathbbm {1}}^{T} x'$ at most $\frac{1+\delta }{1-\varepsilon } \mathsf {OPT}$.

$\square $

1.3 Missing proofs for Sect. 5

In this section we prove Theorem 5.3. Because the proof structure is almost identical to that of Theorem 3.4, we spend most of the discussions only pointing out the difference rather than repeating the proofs. The following three lemmas are completely identical to the ones in the packing LP case, so we restate them below:

Lemma C.1

(cf. Lemma 3.3) Each iteration of $CovLPSolver^{\mathsf {wb}}$ can be implemented to run in expected O(N / n) time.

Lemma C.2

(cf. Lemma 3.6) We have $\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}$ for all $k=0,1,\dots ,T$.

Lemma C.3

(cf. Lemma 3.7) For every $u\in \varDelta _{\mathsf {box}}$, it satisfies $\big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - u \big \rangle \le n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 .$

For the gradient descent guarantee of Sect. 3.3, one can first note that Lemma 2.7 remains true: this can be verified by replacing $\nabla _i f_\mu (x) + 1$ in its proof with $1 - \nabla _i f_\mu (x)$. For this reason, Lemma 3.9 (which is built on Lemma 2.7) also remains true. We state it below:

Lemma C.4

(cf. Lemma 3.9) We have $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle \ge 0$.

Putting all together Denote by $\eta _{k}^{(i)} \in \mathbb {R}_{\le 0}^n $ the vector that is only non-zero at coordinate i, and satisfies $\eta _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) - \xi _{k,i}^{(i)} \in (-\infty , 0]$. In other words, the full gradient

$$\begin{aligned} \nabla f_\mu (\mathsf {x}_k) = \mathbf {E}_i[(0,\dots ,n \nabla _i f_\mu (\mathsf {x}_k),\dots ,0)] = \mathbf {E}_i[ n \eta _{k}^{(i)} + n \xi _k^{(i)} ] \end{aligned}$$

can be (in expectation) decomposed into the a large but non-positive component $\eta _k^{(i)} \in (-\infty , 0]^n$ and a small component $\xi _k^{(i)} \in [-1,1]^n$. Similar as Sect. 3.4, for any $u \in \varDelta _{\mathsf {box}}$, we can use a basic convexity argument and the mirror descent lemma to compute that

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1}-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{=} \frac{(1-\tau ) \alpha _k}{\tau } \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {y}_{k-1} - \mathsf {x}_k \big \rangle \nonumber \\&\qquad + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \end{aligned}$$

(C.1)

$$\begin{aligned}&\quad \overset{\textcircled {{\small 2}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [\; \boxed {\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle } \nonumber \\&\qquad + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \Big ] \end{aligned}$$

(C.2)

Above, $\textcircled {{\small 1}}$is because $\mathsf {x}_{k} = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}$, which implies that $\tau (\mathsf {x}_k - \mathsf {z}_{k-1}) = (1-\tau ) (\mathsf {y}_{k-1} - \mathsf {x}_k)$. $\textcircled {{\small 2}}$uses convexity and Lemma C.3. We can establish the following lemma to upper bound the boxed term in (C.2). Its proof is in the same spirit to that of Lemma 3.10, and is the only place that we require all vectors to reside in $\varDelta _{\mathsf {box}}$.

Lemma C.5

(cf. Lemma 3.10) For every $u \in \varDelta _{\mathsf {box}}$,

$$\begin{aligned} \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 21n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})). \end{aligned}$$

Proof of Lemma C.5

Now there are three possibilities:

If $\eta _{k,i}^{(i)}=0$, then we must have $\xi _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) \in [-1,1]$. Lemma C.4 implies
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad = n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 2 n^2 \alpha _k^2 L \cdot ( f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) ) \end{aligned}$$
If $\eta _{k,i}^{(i)} < 0$ and $\mathsf {z}_{k,i}^{(i)} < \frac{10}{\Vert A_{:i}\Vert _{\infty }}$ (thus $\mathsf {z}_{k}^{(i)}$ is not on the boundary of $\varDelta _{\mathsf {box}}$), then we precisely have $\mathsf {z}_{k,i}^{(i)} = \mathsf {z}_{k-1,i} + \frac{n\alpha _k}{\Vert A_{:i}\Vert _{\infty }}$, and accordingly $\mathsf {y}_{k,i}^{(i)} = \mathsf {x}_{k,i} + \frac{1}{L \Vert A_{:i}\Vert _{\infty }} > \mathsf {x}_{k,i}$. In this case,
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{-10}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{<} n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{-10}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} 10 n \alpha _k L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }\big ( 20 n \alpha _k L + 2 n^2 \alpha _k^2 L \big ) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$
Above, $\textcircled {{\small 1}}$follows from the fact that $\mathsf {z}_{k-1},u \in \varDelta _{\mathsf {box}}$ and therefore $z_{k-1,i}\ge 0$ and $u_i \le \frac{10}{\Vert A_{:i}\Vert _{\infty }}$ by the definition of $\varDelta _{\mathsf {box}}$, and $u\ge 0$; $\textcircled {{\small 2}}$follows from the fact that $\mathsf {x}_k$ and $\mathsf {y}_k^{(i)}$ are only different at coordinate i, and $\xi _{k,i}^{(i)}=-1 > \nabla _i f_\mu (\mathsf {x}_k)$ (since $\eta _{k,i}^{(i)}<0$); $\textcircled {{\small 3}}$follows from the fact that $\mathsf {y}_{k}^{(i)} = \mathsf {x}_{k} + \frac{\mathbf {e}_i}{L \Vert A_{:i}\Vert _{\infty }}$; and $\textcircled {{\small 4}}$uses Lemma C.4.
If $\eta _{k,i}^{(i)} < 0$ and $\mathsf {z}_{k,i}^{(i)} = \frac{10}{\Vert A_{:i}\Vert _{\infty }}$, then we have
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}- \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{\le }\big \langle n \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }4 n^2 \alpha _k^2 L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$
Above, $\textcircled {{\small 1}}$is because $u_i \le \frac{10}{\Vert A_{:i}\Vert _{\infty }} = \mathsf {z}_{k,i}^{(i)}$ and $\eta _{k,i}^{(i)} < 0$, together with $\nabla _i f_\mu (\mathsf {x}_k) < \xi _{k,i}^{(i)}$ and $\mathsf {x}_{k,i} \le \mathsf {y}_{k,i}^{(i)}$; $\textcircled {{\small 2}}$uses $\nabla _i f_\mu (\mathsf {x}_k) = \eta _{k,i}^{(i)} - 1 < \eta _{k,i}^{(i)}$ and $\mathsf {z}_{k,i}^{(i)} \ge \mathsf {z}_{k-1,i}$; $\textcircled {{\small 3}}$is from our choice of $\mathsf {y}_k$ which satisfies that $\mathsf {z}_{k-1} -\mathsf {z}_k^{(i)} = n \alpha _k L (\mathsf {x}_k - \mathsf {y}_k^{(i)})$; and $\textcircled {{\small 4}}$uses Lemma C.4.

Combining the three cases, and using the fact that $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0$, we conclude that

$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\le (20n \alpha _k L + 4n^2 \alpha _k^2 L) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) \\&\le 21 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

Above, the last inequality uses our choice of $\alpha _k$, which implies $n \alpha _k \le n \alpha _T = \frac{1}{\varepsilon L} \le \frac{1}{4}$.

Plugging Lemma C.5 back to (C.2), we have

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [ 21 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] \nonumber \\&\quad \overset{\textcircled {{\small 2}}}{\le }\alpha _k f_\mu (\mathsf {x}_k) + \big (21n\alpha _k L - \alpha _k \big ) f_\mu (\mathsf {y}_{k-1}) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [ -21 n \alpha _k L \cdot f_\mu (\mathsf {y}_k^{(i)}) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] . \end{aligned}$$

(C.3)

Above, $\textcircled {{\small 1}}$uses Lemma C.5; and $\textcircled {{\small 2}}$is because we have chosen $\tau $ to satisfy $\frac{1}{\tau } = 21 n L $.

Next, recall that we have picked $\alpha _{k}$ so that $(21n L - 1) \alpha _k = 21n L \cdot \alpha _{k-1}$ in $CovLPSolver^{\mathsf {wb}}$. Telescoping (C.3) for $k=1,\dots ,T$ and choosing $u^*=(1+\varepsilon /2) x^*$, we have

$$\begin{aligned}&- \sum _{k=1}^T \alpha _k f_\mu (u^*) \le 21 f_\mu (\mathsf {y}_0) - 21 n\alpha _T L\\&\quad \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + \Vert \mathsf {z}_0 - u^*\Vert _A^2 \le - 21 n\alpha _T L \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + 75\mathsf {OPT}. \end{aligned}$$

Here, the second inequality is due to $f_\mu (\mathsf {y}_0) = f_\mu (x^{\mathsf {start}}) \le 3\mathsf {OPT}$ from Fact 5.2, and the fact that

$$\begin{aligned}&\Vert \mathsf {z}_0 - u^*\Vert _A^2 = \Vert u^*\Vert _A^2 = \sum _{i=1}^n (u^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\le (1+\varepsilon /2)^2\sum _{i=1}^n (x^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\\&\quad \le 10(1+\varepsilon /2)^2 \sum _{i=1}^n x^*_i < 12\mathsf {OPT}. \end{aligned}$$

Finally, using the fact that $\sum _{k=1}^T \alpha _k = \alpha _T \cdot \sum _{k=0}^{T-1} \big (1 - \frac{1}{21 nL}\big )^k = 21 n \alpha _T L \big (1-(1-\frac{1}{21 nL})^T\big )$, we rearrange and obtain that

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le \frac{\sum _k \alpha _k}{21 n\alpha _T L} f_\mu (u^*) + \frac{75}{21 n\alpha _T L} \mathsf {OPT}=&\, \big (1-(1-\frac{1}{21 nL})^T\big ) f_\mu (u^*) \nonumber \\&+ \frac{75}{21 n\alpha _T L} \mathsf {OPT}. \end{aligned}$$

We choose $T = \lceil 21 n L \log (1/\varepsilon ) \rceil $ so that $\frac{1}{n\alpha _T L} = (1-\frac{1}{21 n L})^T \le \varepsilon $. Combining this with the fact that $f_\mu (u^*)\le (1+\varepsilon )\mathsf {OPT}$ (see Proposition 4.5a), we obtain

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le (1+\varepsilon )\mathsf {OPT}+ 3.6\varepsilon \cdot \mathsf {OPT}< (1+4.6\varepsilon )\mathsf {OPT}. \end{aligned}$$

Therefore, we have finished proving Theorem 5.3. $\square $

1.4 Missing proofs for Sect. 6

Proposition 6.4

If $\mathsf {z}_{k-1}\in \varDelta _{\mathsf {simplex}}$ and $\mathsf {z}_{k-1}>0$, the minimizer $z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}$ for any scalar $\delta \in \mathbb {R}$ and basis vector $\mathbf {e}_i$ can be computed as follows:

1.
$z \leftarrow \mathsf {z}_{k-1}$.
2.
$z_i \leftarrow z_i \cdot e^{-\delta }$.
3.
If ${\mathbbm {1}}^{T} z > 2\mathsf {OPT}'$, $z \leftarrow \frac{2\mathsf {OPT}'}{{\mathbbm {1}}^{T} z} z$.
4.
Return z.

Proof

Let us denote by z the returned value of the described procedure, and $g(u) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}V_{\mathsf {z}_{k-1}}(u) + \langle \delta \mathbf {e}_i, u \rangle $. Since $\varDelta _{\mathsf {simplex}}$ is a convex body and $g(\cdot )$ is convex, to show $z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \{g(u)\}$, it suffices for us to prove that for every $u \in \varDelta _{\mathsf {simplex}}$, $\langle \nabla g(z), u-z \rangle \ge 0$. Since the gradient $\nabla g(z)$ can be written explicitly, this is equivalent to

$$\begin{aligned} \textstyle \delta (u_i - z_i) + \sum _{\ell =1}^n \log \frac{z_{\ell }}{\mathsf {z}_{k-1,\ell }} \cdot (u_\ell - z_\ell ) \ge 0 . \end{aligned}$$

If the re-scaling in step 3 is not executed, then we have $z_\ell = \mathsf {z}_{k-1,\ell }$ for every $\ell \ne i$, and $z_i = \mathsf {z}_{k-1,i} \cdot e^{-\delta }$; thus, the left-hand side is zero so the above inequality is true for every $u\in \varDelta _{\mathsf {simplex}}$.

Otherwise, we have ${\mathbbm {1}}^{T} z = 2\mathsf {OPT}'$ and there exists some constant $Z>1$ such that, $z_\ell = \mathsf {z}_{k-1,\ell } / Z$ for every $\ell \ne i$, and $z_i = \mathsf {z}_{k-1,i} \cdot e^{-\delta } / Z$. In such a case, the left-hand side equals to

$$\begin{aligned} \textstyle (u_i - z_i) \cdot (\delta - \delta ) + \sum _{\ell =1}^n -\log Z \cdot (u_\ell - z_\ell ) . \end{aligned}$$

It is clear at this moment that since $\log Z > 0$ and ${\mathbbm {1}}^{T} u \le 2\mathsf {OPT}' = {\mathbbm {1}}^{T} z$, the above quantity is always non-negative, finishing the proof. $\square $

Lemma 6.13

Denoting by $\gamma {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}2\alpha _T n$, we have

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta . \end{aligned}$$

Proof

Define $w(x){\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\sum _i x_i \log (x_i) - x_i$ and accordingly, $V_x(y) = w(y) - \langle \nabla w(x), y-x \rangle - w(x) = \sum _i y_i \log \frac{y_i}{x_i} + x_i - y_i$. We first compute using the classical analysis of mirror descent step as follows:

$$\begin{aligned}&\gamma \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} \big \rangle + \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \nonumber \\&\quad = (1+\gamma )\alpha _k \Big \langle n \xi _k^{(i)}, \mathsf {z}^{(i)}_k - \frac{u^*}{1+\gamma } \Big \rangle + (1+\gamma )\alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le } \Big \langle \nabla w(\mathsf {z}_{k-1}) - \nabla w(\mathsf {z}^{(i)}_k), \mathsf {z}^{(i)}_k - \frac{u^*}{1+\gamma } \Big \rangle + (1+\gamma )\alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle \nonumber \\&\quad = \left( w\big (\frac{u^*}{1+\gamma }\big ) - w(\mathsf {z}_{k-1}) - \Big \langle \nabla w(\mathsf {z}_{k-1}), \frac{u^*}{1+\gamma } - \mathsf {z}_{k-1}\Big \rangle \right) \nonumber \\&\qquad - \left( w\big (\frac{u^*}{1+\gamma }\big ) - w(\mathsf {z}^{(i)}_k) - \Big \langle \nabla w(\mathsf {z}^{(i)}_k), \frac{u^*}{1+\gamma } - \mathsf {z}^{(i)}_k\Big \rangle \right) \nonumber \\&\qquad + \left( w(\mathsf {z}_{k-1}) - w(\mathsf {z}^{(i)}_k) - \big \langle \nabla w(\mathsf {z}_{k-1}), \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k\big \rangle \right) \nonumber \\&\qquad + (1+\gamma )\alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle \nonumber \\&\quad = V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) + \boxed {(1+\gamma )\alpha _k\big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle - V_{\mathsf {z}_{k-1}}(\mathsf {z}^{(i)}_k)} . \end{aligned}$$

(D.1)

Above, $\textcircled {{\small 1}}$is because $\mathsf {z}^{(i)}_k = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}}\big \{V_{\mathsf {z}_{k-1}}(z) + \langle (1+\gamma )\alpha _k n \xi _k^{(i)}, z \rangle \big \}$, which is equivalent to saying

$$\begin{aligned}&\forall u \in \varDelta _{\mathsf {simplex}},\quad \langle \nabla V_{\mathsf {z}_{k-1}}(\mathsf {z}^{(i)}_k) + (1+\gamma )\alpha _k n \xi _k^{(i)}, u - \mathsf {z}^{(i)}_k \rangle \ge 0 \\ \Longleftrightarrow \quad&\forall u \in \varDelta _{\mathsf {simplex}},\quad \langle \nabla w(\mathsf {z}^{(i)}_k) - \nabla w(\mathsf {z}_{k-1}) + (1+\gamma )\alpha _k n \xi _k^{(i)}, u - \mathsf {z}^{(i)}_k \rangle \ge 0 . \end{aligned}$$

In particular, we have ${\mathbbm {1}}^{T} \frac{u^*}{1+\gamma } = {\mathbbm {1}}^{T} \frac{(1+\varepsilon /2)x^*}{1+\gamma } < 2\mathsf {OPT}\le 2\mathsf {OPT}'$ and therefore $\frac{u^*}{1+\gamma } \in \varDelta _{\mathsf {simplex}}$. Substituting $u = \frac{u^*}{1+\gamma } $ into the above inequality we get $\textcircled {{\small 1}}$.

Next, we upper bound the term in the box:

$$\begin{aligned}&(1+\gamma )\alpha _k\langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \rangle - V_{\mathsf {z}_{k-1}}(\mathsf {z}^{(i)}_k) \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le }(1+\gamma )\alpha _k n \xi _{k,i} \cdot (\mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i}) - \left( \mathsf {z}^{(i)}_{k,i} \log \frac{\mathsf {z}^{(i)}_{k,i}}{\mathsf {z}_{k-1,i}} + \mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i} \right) \nonumber \\&\quad \overset{\textcircled {{\small 2}}}{\le }(1+\gamma )\alpha _k n \xi _{k,i} \cdot (\mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i}) - \frac{|\mathsf {z}^{(i)}_{k,i} - \mathsf {z}_{k-1,i}|^2}{2\max \{\mathsf {z}^{(i)}_{k,i}, \mathsf {z}_{k-1,i}\}} \nonumber \\&\quad \overset{\textcircled {{\small 3}}}{\le }(1+\gamma )\alpha _k n \xi _{k,i} \cdot (\mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i}) - \frac{|\mathsf {z}^{(i)}_{k,i} - \mathsf {z}_{k-1,i}|^2}{4 \mathsf {z}_{k-1,i}} \nonumber \\&\quad \overset{\textcircled {{\small 4}}}{\le }(1+\gamma )^2\mathsf {z}_{k-1,i} \cdot (\alpha _k n \xi _{k,i})^2 \overset{\textcircled {{\small 5}}}{\le }2\mathsf {z}_{k-1,i} \cdot (\alpha _k n \xi _{k,i})^2 \overset{\textcircled {{\small 6}}}{\le }\mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n |\xi _{k,i}| \nonumber \\&\quad \overset{\textcircled {{\small 7}}}{\le }\mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \xi _{k,i} + 2 \mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \beta = \gamma \alpha _k \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} \rangle + 2 \mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \beta . \end{aligned}$$

(D.2)

Above, $\textcircled {{\small 1}}$uses the facts (i) $a \log \frac{a}{b} + b - a \ge 0$ for any $a,b>0$, (ii) $\mathsf {z}_{k-1,i}-\mathsf {z}_k^{(i)}$ and $\xi _{k,i}$ have the same sign, and (iii) $\xi _{k,i'}^{(i)}=0$ for every $i' \ne i$; $\textcircled {{\small 2}}$uses the inequality that for every $a,b>0$, we have $ a \log \frac{a}{b} + b - a \ge \frac{(a-b)^2}{2\max \{a,b\}}$. $\textcircled {{\small 3}}$uses the fact that $\mathsf {z}^{(i)}_{k,i} \le 2\mathsf {z}_{k-1,i}$.^{Footnote 14}$\textcircled {{\small 4}}$uses Cauchy-Shwarz: $ab - b^2/4 \le a^2$. $\textcircled {{\small 5}}$uses $(1+\gamma )^2 < 2$. $\textcircled {{\small 6}}$uses $|\xi _{k,i}| \le 1$ and $\gamma = 2\alpha _T n \ge 2\alpha _k n$. $\textcircled {{\small 7}}$uses $\xi _{k,i} \ge -\beta $.

Next, we combine (D.1) and (D.2) to conclude that

$$\begin{aligned} \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) + 2 \mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \beta . \end{aligned}$$

Taking expectation on both sides with respect to i, and using the property that ${\mathbbm {1}}^{T} \mathsf {z}_{k-1} \le 3\mathsf {OPT}' \le 6\mathsf {OPT}$, we obtain that

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta . \end{aligned}$$

$\square $

Lemma 6.14

For every $i\in [n]$, we have

(a)
$f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0$, and
(b)
$f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{\mu \beta }{12} \cdot \langle - \widetilde{\eta }_{k}^{(i)}, u^* \rangle .$

Proof of Lemma 6.14 part (a)

Since if $i\not \in B_k$ is not a large index we have $\mathsf {y}_k^{(i)} = \mathsf {x}_k$ and the claim is trivial, we focus on $i\in B_k$ in the remaining proof. Recall that $\mathsf {y}_k^{(i)} = \mathsf {x}_k + \delta \mathbf {e}_i$ for some $\delta >0$ defined in Algorithm 3, so we have

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) = \int _{\tau = 0}^{\delta } \langle - \nabla f_\mu (\mathsf {x}_k + \tau \mathbf {e}_i), \mathbf {e}_i \rangle d \tau = \int _{\tau = 0}^{\delta } \big ( \langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle - 1 \big ) d \tau . \end{aligned}$$

It is clear that $\langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle $ decreases as $\tau $ increases, and therefore it suffices to prove that $\langle A_{:i}, p(\mathsf {x}_k + \delta \mathbf {e}_i) \rangle \ge 1$.

Suppose that the rows of $A_{:i}$ are sorted (for the simplicity of notation) by the increasing order of $A_{j,i}$. Now, by the definition of the algorithm (recall (6.1)), there exists some $j^* \in [m]$ satisfying that

$$\begin{aligned} \sum _{j< j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) < 1+\beta \quad \text {and}\quad \sum _{j \le j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) \ge 1+\beta . \end{aligned}$$

Next, by our choice of $\delta $ which satisfies $\delta = \frac{\mu \beta }{2A_{j^*,i}} \le \frac{\mu \beta }{2A_{j,i}} $ for every $j \le j^*$, we have for every $j\le j^*$:

$$\begin{aligned} p_j(\mathsf {x}_k + \delta \mathbf {e}_i) = p_j(\mathsf {x}_k) \cdot e^{-\frac{A_{j,i} \delta }{\mu }} \ge p_j(\mathsf {x}_k) \cdot e^{-\beta /2} \ge p_j(\mathsf {x}_k) \cdot (1-\beta /2) , \end{aligned}$$

and as a result,

$$\begin{aligned}&\langle A_{:i}, p(\mathsf {x}_k + \delta \mathbf {e}_i) \ge \sum _{j\le j^*} A_{j,i} \cdot p_j (\mathsf {x}_k + \delta \mathbf {e}_i) \ge (1-\beta /2) \sum _{j\le j^*} A_{j,i} \cdot p_j (\mathsf {x}_k) \\&\quad \ge (1-\beta /2) (1+\beta ) \ge 1 . \end{aligned}$$

$\square $

Proof of Lemma 6.14 part (b)

Owing to part (a), for every coordinate i such that $\widetilde{\eta }_{k,i}\ge 0$, we automatically have $f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0$ so the lemma is obvious. Therefore, let us focus only on coordinates i such that $\widetilde{\eta }_{k,i}<0$; these are necessarily large indices $i\in B$. Recall from Definition 6.11 that $\widetilde{\eta }_{k,i} = (1+\beta ) - (\widetilde{A}^T p(\mathsf {x}_k))_i$, so we have

$$\begin{aligned} \textstyle \sum _{j=1}^m \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) - (1+\beta ) > 0 .\end{aligned}$$

For the simplicity of description, suppose again that each i-th column is sorted in non-decreasing order, that is, $A_{1,i}\le \cdots \le A_{m,i}$. The definition of $j^*$ can be simplified as

$$\begin{aligned} \textstyle \sum _{j< j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) < 1+\beta \quad \text {and}\quad \sum _{j \le j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) \ge 1+\beta .\end{aligned}$$

Let $j^{\flat } \in [m]$ be the row such that

$$\begin{aligned} \textstyle \sum _{j< j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) < 1+\beta \quad \text {and}\quad \sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) \ge 1+\beta .\end{aligned}$$

Note that such a $j^{\flat }$ must exist because $\sum _{j=1}^m \widetilde{A}_{j,i} \cdot p_j > 1+\beta $. It is clear that $j^{\flat } \ge j^*$, owing to the definition that $\widetilde{A}_{ji} \le A_{ji}$ for all $i\in [n], j\in [m]$. Defining $\delta ^{\flat } = \frac{\mu \beta }{2A_{j^{\flat },i}} \le \delta $, the objective decrease is lower bounded as

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})&= \int _{\tau = 0}^{\delta } \langle - \nabla f_\mu (\mathsf {x}_k + \tau \mathbf {e}_i), \mathbf {e}_i \rangle d \tau = \int _{\tau = 0}^{\delta } \big ( \langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle - 1 \big ) d \tau \\&\ge \int _{\tau = 0}^{\delta ^{\flat }} \big ( \langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle - 1 \big ) d \tau \\&= \underbrace{\int _{\tau =0}^{\delta ^{\flat }} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) \right) d\tau }_{I}\\&\quad + \underbrace{\sum _{j>j^{\flat }} \int _{\tau =0}^{\delta ^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau }_{I'} \end{aligned}$$

where the inequality is because $\delta ^{\flat } \le \delta $ and $\langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle \ge 1 $ for all $\tau \le \delta $ (see the proof of part (a)).

PartI To lower bound I, we use the monotonicity of $p_j(\cdot )$ and obtain that

$$\begin{aligned} I =&\int _{\tau =0}^{\delta ^{\flat }} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) \right) d\tau \ge \delta ^{\flat }\\&\cdot \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \delta ^{\flat } \mathbf {e}_i) \right) . \end{aligned}$$

However, our choice of $\delta ^{\flat } = \frac{\mu \beta }{2A_{j^{\flat },i}} \le \frac{\mu \beta }{2A_{j,i}} $ for all $j\le j^{\flat }$ ensures that

$$\begin{aligned} \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \delta ^{\flat } \mathbf {e}_i)&\ge \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \cdot e^{\frac{-A_{j,i} \cdot \delta ^{\flat }}{\mu }} \\&\ge \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \cdot (1-\beta /2) . \end{aligned}$$

Therefore, we obtain that

$$\begin{aligned} I \ge \delta ^{\flat } \left( -1 + (1-\beta /2)\sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k)\right) \ge \frac{\delta ^{\flat }}{3} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \right) , \end{aligned}$$

where the inequality is because $\big (\frac{2}{3}-\frac{\beta }{2} \big ) \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \ge \frac{4-3\beta }{6} \cdot (1+\beta ) \ge \frac{2}{3}$ whenever $\beta \le \frac{1}{3}$ (or equivalently, whenever $\varepsilon \le 1/9$).

Now, suppose that $\sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) - (1+\beta ) = b \cdot \widetilde{A}_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k)$ for some $b \in [0,1]$. Note that we can do so by the very definition of $j^{\flat }$. Then, we must have

$$\begin{aligned} -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k)&\ge -1 + \sum _{j< j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) + A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) \\&= -1 + (1+\beta ) - (1-b) \widetilde{A}_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) + A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) \\&\ge \beta + b \cdot A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) . \end{aligned}$$

Therefore, we conclude that

$$\begin{aligned} I&\ge \frac{\delta ^{\flat }}{3} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \right) > \frac{\delta ^{\flat }}{3} \cdot b \cdot A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k)\\&= \frac{\mu \beta }{6\widetilde{A}_{j^{\flat },i}} \cdot b \cdot \widetilde{A}_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) \\&= \frac{\mu \beta }{6\widetilde{A}_{j^{\flat },i}} \cdot \left( - (1+\beta ) + \sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k)\right) \\&\ge \frac{\mu \beta }{12} \cdot u^*_i \cdot \left( - (1+\beta ) + \sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) \right) . \end{aligned}$$

Above, the last inequality is because $u^*_i \cdot \widetilde{A}_{j^{\flat },i} \le \langle \widetilde{A}_{j^{\flat } :}, u^* \rangle \le 2$ by our definition of $\widetilde{A}$.

Part$I'$ To lower bound $I'$, consider every $j> j^{\flat }$ and the integral

$$\begin{aligned} \int _{\tau =0}^{\delta ^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau . \end{aligned}$$

Note that whenever $\tau \le \frac{\mu \beta }{2A_{j,i}} \le \frac{\mu \beta }{2A_{j^{\flat },i}} = \delta ^{\flat }$, we have that $p_j(\mathsf {x}_k + \tau \mathbf {e}_i) \ge p_j(\mathsf {x}_k) \cdot e^{-\beta /2} \ge \frac{1}{2} p_j(\mathsf {x}_k)$. Therefore,

$$\begin{aligned} \int _{\tau =0}^{\delta ^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau \ge \int _{\tau =0}^{\frac{\mu \beta }{2A_{j,i}}} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau \ge \frac{\mu \beta }{2A_{j,i}} \cdot A_{j,i} \cdot \frac{1}{2}p_j (\mathsf {x}_k) . \end{aligned}$$

This implies a lower bound on $I'$:

$$\begin{aligned} I' \ge \sum _{j>j^{\flat }} \frac{\mu \beta }{4A_{j,i}} \cdot A_{j,i} \cdot p_j(\mathsf {x}_k) \ge \frac{\mu \beta }{8} \cdot \sum _{j>j^{\flat }} u^*_i \cdot \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) ,\end{aligned}$$

where again in the last inequality we have used $u^*_i \cdot \widetilde{A}_{j^{\flat },i} \le \langle \widetilde{A}_{j^{\flat } :}, u^* \rangle \le 2$ by our definition of $\widetilde{A}$.

Together Combining the lower bounds on I and $I'$, we obtain

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\ge & {} I + I' \ge \frac{\mu \beta }{12} \cdot u^*_i \cdot \left( - (1+\beta ) + \sum _{j=1}^m \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k)\right) \\= & {} \frac{\mu \beta }{12} \cdot \langle -\widetilde{\eta }_{k}^{(i)}, u^* \rangle . \end{aligned}$$

$\square $

1.5 Proof of Lemma 3.3: Efficient Implementation of $PacLPSolver$

In this section, we illustrate how to implement each iteration of $PacLPSolver$ to run in an expected O(N / n) time. We maintain the following quantities

$$\begin{aligned} \mathsf {z}_k \in \mathbb {R}_{\ge 0}^n, \quad \mathsf {az}_k \in \mathbb {R}_{\ge 0}^m, \quad \mathsf {y}'_k \in \mathbb {R}^n, \quad \mathsf {ay}'_k \in \mathbb {R}^m, \quad B_{k,1},B_{k,2} \in \mathbb {R}_+ \end{aligned}$$

throughout the algorithm, so as to ensure the following invariants are always satisfied

$$\begin{aligned}&A \mathsf {z}_k = \mathsf {az}_k , \end{aligned}$$

(E.1)

$$\begin{aligned}&\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k , \quad A \mathsf {y}_k' = \mathsf {ay}'_k . \end{aligned}$$

(E.2)

It is clear that when $k=0$, letting $\mathsf {az}_k = A \mathsf {z}_0$, $\mathsf {y}'_k = \mathsf {y}_0$, $\mathsf {ay}'_k = A\mathsf {y}_0$, $B_{k,1}=0$, and $B_{k,2}=1$, we can ensure that all the invariants are satisfied initially. We denote $\Vert A_{:i}\Vert _0$ the number of nonzeros elements in vector $A_{:i}$. In each iteration $k=1,2,\dots ,T$:

The step $\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}$ does not need to be implemented.
The value $\nabla _i f(\mathsf {x}_k)$ requires the knowledge of $p_j(\mathsf {x}_k) = e^{\frac{1}{\mu } ((A\mathsf {x}_k)_j - 1)}$ for each j such that $A_{ij}\ne 0$. Accordingly, for each j, we need to know the value
$$\begin{aligned} (A \mathsf {x}_k)_j= & {} \tau (A \mathsf {z}_{k-1})_j + (1-\tau ) (A \mathsf {y}_{k-1})_j \\= & {} \big (\tau + (1-\tau ) B_{k-1,1} \big ) \mathsf {az}_{k-1,j} + (1-\tau ) B_{k-1,2} \mathsf {ay}'_{k-1,j} . \end{aligned}$$
This can be computed in O(1) time for each j, and $O(\Vert A_{:i}\Vert _0)$ time in total.
Recall the step $\mathsf {z}_k \leftarrow {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}$ can be written as $\mathsf {z}_{k} = \mathsf {z}_{k-1} + \delta \mathbf {e}_i$ for some $\delta \in \mathbb {R}$ that can be computed in O(1) time (see Proposition 3.2). Observe also $\mathsf {z}_k = \mathsf {z}_{k-1} + \delta \mathbf {e}_i$ yields $\mathsf {y}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{\delta \mathbf {e}_i}{n \alpha _k L}$ due to Line 6 and Line 10 of Algorithm 1. Therefore, we perform two explicit updates on $\mathsf {z}_k$ and $\mathsf {az}_k$ as
$$\begin{aligned}\mathsf {z}_k \leftarrow \mathsf {z}_{k-1} + \delta \mathbf {e}_i , \quad \mathsf {az}_k \leftarrow \mathsf {az}_{k-1} + \delta A_{:i}\end{aligned}$$
and two implicit updates on $\mathsf {y}_k$ as
$$\begin{aligned} \begin{array}{lll} &{}&{}B_{k,1} = \tau + (1-\tau ) B_{k-1,1} , B_{k,2} = (1-\tau ) B_{k-1,2} , \\ &{}&{}\mathsf {y}'_k \leftarrow \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) , \mathsf {ay}'_k \leftarrow \mathsf {ay}'_{k-1} + \delta A_{:i}\cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \end{array} \end{aligned}$$
It is not hard to verify that after these updates, $ A \mathsf {y}_k' = \mathsf {ay}'_k$ and we have
$$\begin{aligned}&B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k = B_{k,1} \cdot \big (\mathsf {z}_{k-1} + \delta \mathbf {e}_i\big ) \\&\quad + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}_{k-1} + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}_{k-1} + B_{k,2} \cdot \mathsf {y}'_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} \\&\quad = \big (\tau + (1-\tau ) B_{k-1,1}\big ) \cdot \mathsf {z}_{k-1} + \big ((1-\tau ) B_{k-1,2}\big ) \cdot \mathsf {y}'_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} \\&\quad = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} = \mathsf {y}_k, \end{aligned}$$
so the invariant $\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k $ also holds. In sum, after performing updates on $A\mathsf {z}_k$ and $\mathsf {ay}'_k$ in time $O(\Vert A_{:i}\Vert _0)$, we can ensure that the invariants in (E.1) and (E.2) are satisfied at iteration k.

In sum, we only need $O(\Vert A_{:i}\Vert _0)$ time to perform the updates in $PacLPSolver$ for an iteration k if the coordinate i is selected. Therefore, each iteration of $PacLPSolver$ can be implemented to run in an expected $O(\mathbf {E}_i[\Vert A_{:i}\Vert _0]) = O(N/n)$ time.

1.6 Proof of Lemma 6.5: Efficient Implementation of $CovLPSolver$

In this section we illustrate how to implement each iteration of $CovLPSolver$ to run in an expected O(N / n) time. We maintain the following quantities

$$\begin{aligned}&\mathsf {z}_k' \in \mathbb {R}_+^n, \quad \mathsf {sz}_k \in \mathbb {R}_+, \quad \mathsf {sumz}_k \in \mathbb {R}_+, \quad \mathsf {az}'_k \in \mathbb {R}_{\ge 0}^m, \quad \mathsf {y}'_k \in \mathbb {R}^n, \\&\mathsf {ay}'_k \in \mathbb {R}^m, \quad B_{k,1},B_{k,2} \in \mathbb {R}_+ \end{aligned}$$

throughout the algorithm, so as to maintain the following invariants

$$\begin{aligned}&\mathsf {z}_k = \mathsf {z}_k' / \mathsf {sz}_k, \quad&\mathsf {sumz}_k = {\mathbbm {1}}^{T} \mathsf {z}'_k, \quad&A \mathsf {z}_k = \mathsf {az}'_k / \mathsf {sz}_k, \end{aligned}$$

(F.1)

$$\begin{aligned}&\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k' + B_{k,2} \cdot \mathsf {y}'_k, \quad&A \mathsf {y}_k = \mathsf {ay}'_k . \end{aligned}$$

(F.2)

It is clear that when $k=0$, letting $\mathsf {z}_k' = \mathsf {z}_0$, $\mathsf {sz}_k=1$, $\mathsf {sumz}_k = {\mathbbm {1}}^{T} \mathsf {z}_0$, $\mathsf {az}'_k = A\mathsf {z}_0$, $\mathsf {y}'_k = \mathsf {y}_0$, $\mathsf {ay}'_k = A\mathsf {y}_0$, $B_{k,1}=0$, and $B_{k,2}=1$, we can ensure that all the invariants are satisfied initially.

We denote by $\Vert A_{:i}\Vert _0$ the number of nonzero elements in vector $A_{:i}$. In each iteration $k=1,2,\dots ,T$:

The step $\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}$ does not need to be implemented.
The value $p_j(\mathsf {x}_k) = e^{\frac{1}{\mu } (1 - (A\mathsf {x}_k)_j)}$ for each j only requires the knowledge of
$$\begin{aligned} (A \mathsf {x}_k)_j= & {} \tau (A \mathsf {z}_{k-1})_j + (1-\tau ) (A \mathsf {y}_{k-1})_j \\= & {} \big (\tau + (1-\tau ) B_{k-1,1} \big ) \frac{\mathsf {az}'_{k-1,j}}{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,2} \mathsf {ay}'_{k-1,j} . \end{aligned}$$
This can be computed in O(1) time.
The value $\nabla _i f(\mathsf {x}_k)$ requires the knowledge of $p_j(\mathsf {x}_k)$ for each $j\in [m]$ such that $A_{ij}\ne 0$. Since we have $\Vert A_{:i}\Vert _0$ such j’s, we can compute $\nabla _i f(\mathsf {x}_k)$ in $O(\Vert A_{:i}\Vert _0)$ time.
Letting $\delta = (1+\gamma ) n \alpha _k \xi _{k,i}^{(i)}$, recall that the mirror step $\mathsf {z}_k \leftarrow {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}$ has a very simple form (see Proposition 6.4): first multiply the i-th coordinate of $\mathsf {z}_{k-1}$ by $e^{-\delta }$ and then, if the sum of all coordinates have exceeded $2\mathsf {OPT}'$, scale everything down so as to sum up to $2\mathsf {OPT}'$. This can be implemented as follows: setting $\delta _1 = \mathsf {z}'_{k-1,i} (e^{-\delta }-1)$,
$$\begin{aligned} \begin{array}{l} \mathsf {z}'_{k} \leftarrow \mathsf {z}'_{k-1} + \delta _1 \mathbf {e}_i , \mathsf {az}'_k \leftarrow \mathsf {az}'_{k-1} + \delta _1 A_{:i}, \\ \mathsf {sumz}_{k} \leftarrow \mathsf {sumz}_{k-1} + \delta _1 , \mathsf {sz}_k \leftarrow \mathsf {sz}_k \cdot \max \Big \{1, \frac{\mathsf {sumz}_k}{\mathsf {sz}_{k-1}\cdot 2\mathsf {OPT}'} \Big \} . \end{array} \end{aligned}$$
These updates can be implemented to run in $O(\Vert A_{:i}\Vert _0)$ time, and they together ensure that the invariants in (F.1) are satisfied at iteration k.
Recall that the gradient step is of the form $\mathsf {y}_k \leftarrow \mathsf {x}_k + \delta _2 \cdot \mathbf {e}_i$ for some value $\delta _2 \ge 0$. This value $\delta _2$ can be computed in $O(\Vert A_{:i}\Vert _0)$ time, since each $p_j(\mathsf {x}_k)$ can be computed in O(1) time, and we can sort the rows of each column of A by preprocessing.

Since $\mathsf {y}_k = \mathsf {x}_k + \delta _2 \cdot \mathbf {e}_i = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \delta _2 \mathbf {e}_i$, we can implement this update by letting
$$\begin{aligned} \begin{array}{l} B_{k,1} = \frac{\tau }{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,1} , B_{k,2} = (1-\tau ) B_{k-1,2} \\ \mathsf {y}'_k \leftarrow \mathsf {y}'_{k-1} + \mathbf {e}_i \cdot \left( - \frac{B_{k,1} \delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) , \mathsf {ay}'_k \leftarrow \mathsf {ay}'_{k-1} + A_{:i}\cdot \left( - \frac{B_{k,1} \delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) \end{array}\end{aligned}$$
It is not hard to verify that after these updates, $\mathsf {ay}'_k = A \mathsf {y}'_k$ and we have
$$\begin{aligned}&B_{k,1} \cdot \mathsf {z}'_k + B_{k,2} \cdot \mathsf {y}'_k = B_{k,1} \cdot \big (\mathsf {z}'_{k-1} + \delta _1 \mathbf {e}_i\big ) \\&\qquad + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \mathbf {e}_i \cdot \left( - \frac{B_{k,1}\delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}'_{k-1} + B_{k,2} \cdot \big (\mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i / B_{k,2} \big ) \\&\quad = B_{k,1} \cdot \mathsf {z}'_{k-1} + B_{k,2} \cdot \mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i \\&\quad = \big (\frac{\tau }{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,1}\big ) \cdot \mathsf {z}'_{k-1} + \big ((1-\tau ) B_{k-1,2}\big ) \cdot \mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i \\&\quad = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \delta _2 \mathbf {e}_i = \mathsf {y}_k , \end{aligned}$$
so that the invariant $\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k' + B_{k,2} \cdot \mathsf {y}'_k$ is also satisfied. In sum, after running time $O(\Vert A_{:i}\Vert _0)$, we can ensure that the invariants in (F.2) are satisfied at iteration k.

In sum, we only need $O(\Vert A_{:i}\Vert _0)$ time to perform the updates in $CovLPSolver$ for an iteration k if the coordinate i is selected. Therefore, each iteration of $CovLPSolver$ can be implemented to run in an expected $O(\mathbf {E}_i[\Vert A_{:i}\Vert _0]) = O(N/n)$ time.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Allen-Zhu, Z., Orecchia, L. Nearly linear-time packing and covering LP solvers. Math. Program. 175, 307–353 (2019). https://doi.org/10.1007/s10107-018-1244-x

Download citation

Received: 08 August 2016
Accepted: 30 January 2018
Published: 14 February 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s10107-018-1244-x

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Nearly linear-time packing and covering LP solvers

Abstract

Similar content being viewed by others

\(\ell _1\)-Sparsity approximation bounds for packing integer programs

Packing Interdiction and Partial Covering Problems

How Good Are Sparse Cutting-Planes?

1 Introduction

1.1 Our results

1.2 Main challenge and our approach

2 Relaxation of the packing linear program

Fact 2.1

Proof

Fact 2.2

Remark 2.3

Proposition 2.4

Remark 2.5

Proof of Proposition 2.4

Fact 2.6

Lemma 2.7

Proof of Lemma 2.7

Fact 2.8

Proof

3 Our packing LP solver

Remark 3.1

Proposition 3.2

Lemma 3.3

3.1 Convergence statement

Proof of Corollary 3.5

Lemma 3.6

3.2 Step 1: mirror descent guarantee

Lemma 3.7

Proof

Fact 3.8

3.3 Step 2: gradient descent guarantee

Lemma 3.9

Proof of Lemma 3.9

3.4 Step 3: putting all together

Lemma 3.10

Proof of Lemma 3.10

4 Relaxation of the covering linear program

Proposition 4.1

Proof

Definition 4.2

Definition 4.3

Fact 4.4

Proposition 4.5

5 Our covering LP solver in the well-conditioned case

Assumption 5.1

Fact 5.2

Proof

Theorem 5.5

6 Our covering LP solver in the general case

Fact 6.1

Proposition 6.2

Proof

Remark 6.3

Proposition 6.4

Lemma 6.5

6.1 Main proof ideas and ingredients

6.2 Convergence statement

Fact 6.8

Proof

6.3 Step 1: distance adjustment

Definition 6.9

Lemma 6.10

Proof of Lemma 6.10

6.4 Step 2: gradient truncation

Definition 6.11

Lemma 6.12

Proof of Lemma 6.12

6.5 Step 3: mirror descent guarantee

Lemma 6.13

6.6 Step 4: gradient descent guarantee

Lemma 6.14

6.7 Step 5: putting all together

Remark 6.15

Notes

References

Author information

Authors and Affiliations