1 Introduction

A packing linear program (LP) takes the form \(\max \{c^T x \,:\, Ax \le b\}\) where \(c \in \mathbb {R}_{\ge 0}^n\), \(b \in \mathbb {R}_{\ge 0}^m\), and \(A \in \mathbb {R}_{\ge 0}^{m \times n}\). A covering LP can be written as \(\min \{b^T y\,:\, A^T y \ge c\},\) with the same requirements on Ab, and c. We denote by N the number of non-zero elements in matrix A. We assume without loss of generality that the two LP programs are in their standard forms:

$$\begin{aligned} \text {Packing LP:}&\quad \max _{x \in \mathbb {R}_{\ge 0}^n } \{ {\mathbbm {1}}^{T} x \,:\, Ax \le {\mathbbm {1}}\} , \end{aligned}$$
(1.1)
$$\begin{aligned} \text {Covering LP:}&\quad \min _{y \in \mathbb {R}_{\ge 0}^m } \{ {\mathbbm {1}}^{T} y \,:\, A^T y \ge {\mathbbm {1}}\} . \end{aligned}$$
(1.2)

The two programs are dual to each other, so we denote by \(\mathsf {OPT}\ge 0\) their shared optimum. We say x is a \((1-\varepsilon )\)-approximation for the packing LP if \(Ax \le {\mathbbm {1}}\) and \({\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}\), and y a \((1+\varepsilon )\)-approximation for the covering LP if \(A^T y \ge {\mathbbm {1}}\) and \({\mathbbm {1}}^{T} y \le (1+\varepsilon )\mathsf {OPT}\).

In this paper, we study first-order iterative methods for solving packing and covering linear programs (PC-LP s) efficiently.Footnote 1 Of course, it is possible to adopt the Interior Point or Ellipsoid methods to obtain approximate solvers with a \(\log (1/\varepsilon )\) dependence on the number of iterations. However, the computational cost of such algorithms is typically high, as each iteration requires solving a linear system, and thus is not suitable for large-scale applications.

To address this issue, researchers have developed iterative approximatePC-LP solvers that achieve a better dependence on the problem size (e.g., nearly linear in N) at the cost of having a \(\textsf {poly}(1/\varepsilon )\) dependence on the approximation parameter \(\varepsilon \). Such iterative solvers have been widely applied in approximation algorithms (e.g., MinSetCover [24], MaxSet, MaxDiCut, Max-k-CSP [32], bipartite matching), probabilistic checkable proofs [32], zero-sum matrix games [29], scheduling [31], graph embedding [31], flow controls [10, 11], auction mechanisms [37], wireless sensor networks [14], and many other areas. In addition, techniques developed in this line of research have inspired important results on other fundamental algorithmic problems, such as the design of fast algorithms for multi-commodity flow problems [9, 18, 19, 25, 31] and the equivalence between QIP and PSPACE [21].

Table 1 Comparisons among iterative approximate solvers for packing and covering LPs

Previous iterative approximate solvers can be divided into two classes, width-dependent and width-independent solvers (see also Table 1).

Width-dependent solversFootnote 2 Based on multiplicative weight update ideas (a.k.a. exponentiated gradient updates), researchers have obtained solvers for PC-LP s with a running time at least N multiplied with \(\rho \mathsf {OPT}\in [1,\infty )\), where \(\rho \) is the width of the program, i.e., the largest entry of matrix A. For instance, PC-LP s can be solved in \(O(\frac{N \rho ^2 \mathsf {OPT}^2 \log m }{\varepsilon ^2})\)-time [31], or \(O(\frac{N \rho \mathsf {OPT}\log m }{\varepsilon ^2})\)-time using some more refined analysis [7]. These algorithms only require “oracle-access” to the matrix A. When A is given explicitly like in this paper, the running time can be reduced to \(O(\frac{N \rho \mathsf {OPT}\log m}{\varepsilon })\) by deploying Nesterov’s accelerated gradient method [29], or Nemirovski’s mirror prox method [26].

Width-dependent algorithms are not polynomial time but only pseudo-polynomial time.

Width-independent, but super linear-time solvers Researchers also tried to appropriately scale the matrix so as to avoid the width penalty in the above methods. For instance, Bienstock and Iyengar [13] built on Nesterov’s method [29] and obtained a running time \(O(\varepsilon ^{-1} N \sqrt{Kn\log m})\) where K is the maximum number of non-zeros per row of A. This is \(O(\varepsilon ^{-1} N n \sqrt{\log m})\) in the worst case. The results of [15, 27] improved this complexity (for packing LP only) to \(\widetilde{O}(\varepsilon ^{-1} N \sqrt{n})\), at a cost of enduring an \(\widetilde{O}(N n)\)-time preprocessing stage.

Width-independent, nearly linear-time solvers Perhaps the most desirable complexity is a running time that is both independent of the width parameter \(\rho \), and also nearly linearly scales with N.Footnote 3 This line of research was initiated by a seminal paper of Luby and Nisan [24], who gave an algorithm running in \(O\big (\frac{ N \log ^2 N }{\varepsilon ^4}\big )\) time with no dependence on the width \(\rho .\) This is also the first nearly linear-time approximate solver for PC-LP s, and also the first to run in parallel in nearly linear-work and polylogarithmic depth.

The parallel algorithm of Luby and Nisan was extended by [4, 8, 10, 33, 35]. Most notably, the algorithm of Wang et al. [33] runs in \(O(\frac{\log ^2 N \log (1/\varepsilon )}{\varepsilon ^2})\) iterations, each costing a matrix-vector multiplication that can be implemented in O(N) total work.

The ideas of Luby and Nisan also led to sequential width-independent, nearly linear-time PC-LP solvers [10, 11, 23, 35, 36]. Most notably, the algorithm of Koufogiannakis and Young [23] runs in time \(O\big (N + \frac{ \log N }{\varepsilon ^2} \times (n + m)\big )\).

Despite the amount of work in this area, the \(O(1/\varepsilon ^2)\) convergence rate was established in 1997 [10, 11] and has not been improved since then. On a separate note, Klein and Young [22] showed that all Dantzig-Wolfe type algorithms have to suffer from a \(O(1/\varepsilon ^2)\) convergence rate. This lack of progress constitutes a significant limitation, as the \(\varepsilon ^{-2}\)-dependence (also known as the \(1/\sqrt{T}\) convergence) on the approximation parameter \(\varepsilon \) is particularly poor.

1.1 Our results

Packing LP We present an algorithm \(PacLPSolver\) that runs in \(O(\frac{\log (nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon } N) \) total time. This gives the first width-independent, and the first nearly linear-time solver for packing LP with an \(\varepsilon ^{-1}\) convergence (i.e., an 1 / T convergence). In contrast, no nearly linear-time algorithm has achieved any convergence rate faster than \(\varepsilon ^{-2}\) before our work.

Interestingly, the maximum (weighted) bipartite matching is just one instance of a packing LP. As a consequence, our \(PacLPSolver\) algorithm finds an approximate maximum bipartite matching in time \(\widetilde{O}(m \varepsilon ^{-1})\). This new matching algorithm, which arises purely from convex-optimization arguments, matches the running time of the best known combinatorial algorithm for maximum weighted bipartite matching [16].

Covering LP A symmetric design of \(PacLPSolver\) gives rise to an algorithm \(CovLPSolver^{\mathsf {wb}}\) with the same running time \(O(\frac{\log (nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon } N)\), but only solving well-behaved covering LP instances. At a high level, we say an instance is well-behaved if the constraint \(A^T y \ge {\mathbbm {1}}\) is “never redundant”: for instance, if the optimal solution \(y^*\) satisfies \(C \cdot {\mathbbm {1}}\ge A^T y^* \ge {\mathbbm {1}}\) for some constant \(C>1\) then the covering LP is well-behaved. For the general covering LP without well-behavior assumptions, we propose a different algorithm \(CovLPSolver\) that runs in time \(O(\frac{\log (nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon ^{1.5}} N)\). Again, we emphasize that no nearly linear-time covering LP solver can achieve a convergence rate faster than \(\varepsilon ^{-2}\) (or equivalently \(O(1/\sqrt{T})\)) before our work.

Remark After the first version of this paper appeared on arXiv in 2014, Wang, Rao and Mahoney [34] showed all covering LPs can be converted into well-behaved ones, by blowing up the problem size logarithmically. In other words, they obtained a nearly linear-time covering LP solver with \(\varepsilon ^{-1}\) convergence by a reduction to \(CovLPSolver^{\mathsf {wb}}\). Nevertheless, our \(CovLPSolver\), being a direct method, may still be of practical and theoretical interests.

1.2 Main challenge and our approach

Width-independence versus acceleration Previous solvers for PC-LP s are based on standard techniques in non-smooth optimization. They first implicitly or explicitly smoothen the objective, often by the entropy regularizer. Then, they minimize the resulting convex objective either via variations of full-gradient methods, yielding parallel algorithms, or via variations of coordinate-gradient methods, yielding sequential algorithms. The main challenge in previous work is to show that the width dependence can sometimes be completely removed for PC-LP s, if the underlying minimization method is designed cleverly.

Of course, the slower the convergence rate is, the easier it is to design nearly linear-time solvers. The \(\varepsilon ^{-4}\)-convergence solver of Awerbuch and Khandekar [8] and the \(\varepsilon ^{-3}\)-convergence solver of [4] are arguably the simplest nearly linear-time solvers at this point.

In this paper, we achieve the \(\varepsilon ^{-1}\) convergence that is typical for accelerated gradient descent over smoothened objectives [29], but without paying the width or any additional super-logarithmic factors. The challenge in this approach is to preserve the width-independence and the accelerated rate at the same time. We stress here that our algorithm is not an instance of any known variant of accelerated gradient descent.Footnote 4 Moreover, the incorporation of width-independence and Nesterov’s acceleration requires significant effort, as witnessed by the lack of progress on this problem for the last 15 years.

Our high-level approach Our approach is based on an improved convex formalization f(x) of the PC-LP objective, together with our linear-coupling framework for designing efficient first-order methods [5] for minimizing f(x).

The improved formalization shows that our smoothened objective f(x) satisfies either the classical condition for Lipschitz smoothness or a different condition based on multiplicative change. This formalization also clarifies why width-independent algorithms exist in the first place. See Lemma 2.7 and the related discussion for more details.

The linear-coupling framework in our previous work [5] provides a different interpretation of Nesterov’s acceleration for smooth optimization [28]. In a nutshell, this linear-coupling framework allows us to construct accelerated algorithms by coupling the executions of a gradient descent algorithm, yielding iterates \(\{\mathsf {y}_k\}\) and a mirror descent step algorithm, with iterates \(\{\mathsf {z}_k\}.\) The name “linear coupling” stems from the fact that, at iteration \(k+1,\) the gradient of the objective is queried at a point \(\mathsf {x}_{k+1}\), which is a linear combination of gradient and mirror steps, i.e., \(\mathsf {x}_{k+1} = (1-\tau ) \cdot z_k + (1-\tau ) \cdot y_k\).

In this paper, we apply linear coupling in a very non-trivial manner. We design a gradient and a mirror descent step, each very specific to the underlying PC-LP problem. We also perform a coupling step \(\mathsf {x}_{k+1} = (1-\tau ) \cdot z_k + (1-\tau ) \cdot y_k\), but need to design a different analysis to preserve width independence. None of these components has appeared in [5].

Arithmetic precision Throughout this paper, we assume exact arithmetic operations for presenting the cleanest proofs. If the updates are calculated within precision \(\frac{1}{\textsf {poly}(\varepsilon ^{-1},n,m)}\), or equivalently when word size \(O(\log (\varepsilon ^{-1}+n+m))\) is used, our results still hold.Footnote 5

Roadmap We relax the packing LP in Sect. 2, and provide our packing LP solver in Sect. 3. We relax the covering LP in Sect. 4, and provide our covering LP solver in the well-behaved case in Sect. 5. In Sect. 6, we provide our full covering LP solver.

2 Relaxation of the packing linear program

To solve packing LP, we minimize a relaxed version of the original LP, where the hard constraint \(Ax \le 1\) is regularized by entropy and replaced by an exponential penalty function.

Notations Recall that the packing LP in its standard form is \( \max _{x \ge 0} \{ {\mathbbm {1}}^{T} x \,:\, Ax \le {\mathbbm {1}}\}\). Let us denote by \(\mathsf {OPT}\) the optimal value of this linear program, and \(x^*\) any optimal solution. We say that x is a \((1-\varepsilon )\)-approximation for the packing LP if \(Ax \le {\mathbbm {1}}\) and \({\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}\).

Throughout this paper, we use the indices \(i\in [n]\) to denote the columns of A, and the indices \(j\in [m]\) to denote the rows of A. We let \(A_{:i}\) be the i-th column vector of A, and \(A_{j :}\) the j-th row vector of A. Given any vector x, we denote by \(\Vert x\Vert _A = \sqrt{\sum _{i\in [n]} x_i^2 \cdot \Vert A_{:i}\Vert _{\infty }}\) the A-norm of x. By simple scaling, we can assume without loss of generality thatFootnote 6

$$\begin{aligned} \min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\}=1 . \end{aligned}$$
(2.1)

We restrict the domain of x and the range of \(\mathsf {OPT}\) as follows.

Fact 2.1

Define the bounding box \(\varDelta _{\mathsf {box}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ x \in \mathbb {R}^n \,:\, x_i \in \big [0, \frac{1}{\Vert A_{:i}\Vert _{\infty }}\big ] \}\). Under assumption (2.1), we have \(\mathsf {OPT}\in [1,n]\) and \(\{x : x\ge 0 \wedge Ax \le {\mathbbm {1}}\} \subseteq \varDelta _{\mathsf {box}}.\)

Proof

Suppose that \(i^*\) is the column that achieves the smallest infinite norm \(\Vert A_{:i}\Vert _{\infty }\) over all columns. Letting x be such that \(x_i=1\) at \(i=i^*\) and \(x_i=0\) at \(i\ne i^*\), we claim that x is a feasible solution for the packing LP (1.1), simply because \(\Vert A_{:i^*}\Vert _{\infty } = 1\) according to (2.1). This feasible solution x yields an objective value \({\mathbbm {1}}^{T} x = 1\), proving that \(\mathsf {OPT}\ge 1\). On the other hand, for any solution \(x \ge 0\) satisfying \(Ax \le {\mathbbm {1}}\), we must have \(x_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}\) for each i. Therefore, \({\mathbbm {1}}^{T} x \le \sum _i \frac{1}{\Vert A_{:i}\Vert _{\infty }} \le n\), proving that \(\mathsf {OPT}\le n\).

The inclusion \(\{x : x\ge 0 \wedge Ax \le {\mathbbm {1}}\} \subseteq \varDelta _{\mathsf {box}}\) is obvious, since the constraints \(x\ge 0\) and \(Ax \le {\mathbbm {1}}\) together imply \(x_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}\) for every \(i\in [n]\). \(\square \)

This bounding-box constraint allows us to focus only on searching x in \(\varDelta _{\mathsf {box}}\).

Our regularized objective We now introduce the smoothed objective \(f_\mu (x)\) that we minimize over \(\varDelta _{\mathsf {box}}\) in order to approximately solve packing LP. At a high level, this objective \(f_\mu (x)\) turns each row of the hard, non-smooth LP constraint \(Ax \le {\mathbbm {1}}\) into an exponential penalty function so that we only need to require \(x \in \varDelta _{\mathsf {box}}\) throughout the algorithm.

Formally, the packing LP can be written as the following minimization problem by introducing the Lagrangian variable \(y \in \mathbb {R}^m\):

$$\begin{aligned}&\min _{x \in \varDelta _{\mathsf {box}}} \big \{ - {\mathbbm {1}}^{T} x + \max _{y \ge 0} \{y^T A x - {\mathbbm {1}}^{T}y \} \big \} . \end{aligned}$$
(2.2)

The problem can be now smoothened by introducing a concave regularizer over \(y \ge 0.\) We take this regularizer to be the generalized entropy \(H(y) = - \sum _{j=1}^m y_j \log y_j + y_j\) over the first orthant \(y\ge 0\), and minimize the following smoothened objective \(f_\mu (x)\) over \(x \in \varDelta _{\mathsf {box}}\):

$$\begin{aligned} f_\mu (x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}- {\mathbbm {1}}^{T} x + \max _{y \ge 0} \{y^T A x - {\mathbbm {1}}^{T}y + \boxed {\mu \cdot H(y)} \} . \end{aligned}$$
(2.3)

Above, \(\mu > 0\) is some smoothing parameter to be chosen later. By explicitly computing the maximization over \(y\ge 0\), \(f_\mu (x)\) can be rewritten as

Fact 2.2

\( f_\mu (x) = \mu \sum _{j=1}^m e^{\frac{1}{\mu } ((Ax)_j - 1)} - {\mathbbm {1}}^{T} x .\)

We study the minimization problem on \(f_\mu (x)\) over \(x \in \varDelta _{\mathsf {box}}\). Intuitively \(f_\mu (x)\) captures the original packing LP (1.1) as follows. Firstly, since we want to maximize \({\mathbbm {1}}^{T} x\), the negative term \(-{\mathbbm {1}}^{T} x\) shows up in \(f_\mu (x)\). Secondly, if a packing constraint \(j\in [m]\) is violated by \(\varepsilon \), that is, \((Ax)_j \ge 1+\varepsilon \), the exponential penalty in \(f_\mu (x)\) introduces a penalty at least \(\mu e^{\varepsilon /\mu }\); this will be a large penalty if \(\mu \le O(\varepsilon / \log (n/\varepsilon ))\).

Remark 2.3

The use of exponential function at least traces back to [31] in 1991 (implicitly) and to [20] in 1994 (explicitly). The way most previous results minimize \(f_\mu (x)\) is by taking a logarithm \(g(x) = \log \big ( \sum _{j=1}^m e^{((Ax)_j - 1)/\mu } \big )\), explicitly or implicitly arguing that g(x) is Lipschitz smooth (i.e., \(\Vert \nabla ^2 f(x)\Vert \) is bounded), and then taking gradient descent.Footnote 7 Unfortunately, the Lipschitz smoothness parameter of g(x) depends on the width of the LP, and thus first-order iterative approaches based on directly minimizing g(x) are mostly width-dependent [7, 26, 29, 31]. One can also reduce the width parameter of g(x) which yields super linear-time solvers [13, 15, 27].

In this paper, we directly perform gradient descent and mirror descent on \(f_\mu (x)\)—without taking the logarithm. Note that traditional accelerated gradient methods [28, 29] should not be applied directly to minimize \(f_\mu \) because it is not Lipschitz smooth.Footnote 8

Our \(f_\mu (x)\) incurs a regularization error. The next proposition bounds this error following a similar treatment in [4].

Proposition 2.4

Let \(\mu = \frac{\varepsilon }{4\log (nm/\varepsilon )}\) and recall \(x^*\) is an optimal solution for packing LP.

  1. (a)

    \(f_\mu (u^*) \le - (1-\varepsilon )\mathsf {OPT}\) for \(u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1-\varepsilon /2) x^* \in \varDelta _{\mathsf {box}}\).

  2. (b)

    \(f_\mu (x) \ge -(1+\varepsilon )\mathsf {OPT}\) for every \(x \in \varDelta _{\mathsf {box}}\).

  3. (c)

    If \(x\in \varDelta _{\mathsf {box}}\) satisfies \(f_\mu (x) \le -(1-\theta )\mathsf {OPT}\) for some \(\theta \in [0,1]\), then \(\frac{1}{1+\varepsilon } x\) is a \(\frac{1-\theta }{1+\varepsilon }\)-approximate solution to the packing LP.

Remark 2.5

Our box constraint \(x \in \varDelta _{\mathsf {box}}\) is almost redundant for minimizing \(f_\mu (x)\): whenever \(x\ge 0\) and \(f_\mu (x) \le 0\), one should automatically have \(x_i \le \frac{1+\varepsilon }{\Vert A_{:i}\Vert _{\infty }}\). However, this constraint shall be used to make sure that our updates are always inside \(\varDelta _{\mathsf {box}}\).

Proof of Proposition 2.4

  1. (a)

    We have \({\mathbbm {1}}^{T} u^* = (1-\varepsilon /2)\mathsf {OPT}\) by the definition of \(\mathsf {OPT}\). from the feasibility \(A x^* \le {\mathbbm {1}}\) in the packing LP, we have \(A u^* - {\mathbbm {1}}\le -\varepsilon /2 \cdot {\mathbbm {1}}\), and can compute \(f_\mu (u^*)\) as follows:

    $$\begin{aligned} f_\mu (u^*)&= \mu \sum _j e^{\frac{1}{\mu } ((A u^*)_j - 1)} - {\mathbbm {1}}^{T} u^*\\&\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} - (1-\varepsilon /2)\mathsf {OPT}\\&\le \frac{\mu m}{(nm)^2} - (1-\varepsilon /2)\mathsf {OPT}\le -(1-\varepsilon )\mathsf {OPT}. \end{aligned}$$
  2. (b)

    Suppose towards contradiction that \(f_\mu (x) < -(1+\varepsilon )\mathsf {OPT}\). Since \(f_\mu (x) > -{\mathbbm {1}}^{T} x\), it must satisfy that \({\mathbbm {1}}^{T} x > (1+\varepsilon )\mathsf {OPT}\). Suppose that \({\mathbbm {1}}^{T} x = (1+v)\mathsf {OPT}\) for some \(v>\varepsilon \). By the definition of \(\mathsf {OPT}\), we must have that \(A x < (1+v) {\mathbbm {1}}\) is broken, and therefore there exists some \(j \in [m]\) satisfying that \((Ax)_j \ge 1+v\). In such a case, the objective

    $$\begin{aligned} f_\mu (x)\ge & {} \mu e^{v/\mu } - (1+v)\mathsf {OPT}= \frac{\varepsilon }{4\log (nm/\varepsilon )} \left( \left( \frac{nm}{\varepsilon }\right) ^4\right) ^{v/\varepsilon } - (1+v)\mathsf {OPT}\\\ge & {} \left( \left( \left( \frac{nm}{\varepsilon }\right) ^2\right) ^{v/\varepsilon } - (1+v) \right) \mathsf {OPT}> 0 \end{aligned}$$

    giving a contradiction to the assumption that \(f_\mu (x)<0\).

  3. (c)

    Note that x satisfies \(f_\mu (x)\le -(1-\delta )\mathsf {OPT}\le 0\), and we first show \(Ax \le (1+\varepsilon ){\mathbbm {1}}\). Let us assume that \(v = \max _j ((Ax)_j - 1) \ge 0\) because otherwise we will have \(Ax \le {\mathbbm {1}}\). Under this definition, we have \(Ax \le (1+v) {\mathbbm {1}}\) and therefore \({\mathbbm {1}}^{T} x \le (1+v)\mathsf {OPT}\) by the definition of \(\mathsf {OPT}\). We compute \(f_\mu (x)\) as follows.

    $$\begin{aligned}&f_\mu (x) \ge \mu e^{\frac{v}{\mu }} - (1+v)\mathsf {OPT}\ge \mu \left( \left( \frac{nm}{\varepsilon }\right) ^4\right) ^{v/\varepsilon } - (1+v)n \\&\quad = \frac{\varepsilon }{4\log (nm/\varepsilon )} \left( \left( \frac{nm}{\varepsilon }\right) ^4\right) ^{v/\varepsilon } - (1+v)n . \end{aligned}$$

    The above quantity is positive whenever \(v \ge \varepsilon \), and therefore, to satisfy \(f_\mu (x)\le 0\) we must have \(v\le \varepsilon \), which is equivalent to \(Ax \le (1+\varepsilon ){\mathbbm {1}}\). Next, because \(-{\mathbbm {1}}^{T} x \le f_\mu (x) \le -(1-\delta )\mathsf {OPT}\), we know \({\mathbbm {1}}^{T} x \ge (1-\delta )\mathsf {OPT}\). Letting \(x' = \frac{1}{1+\varepsilon } x\), we both have that \(x'\) is feasible (i.e., \(Ax' \le {\mathbbm {1}}\)), and \(x'\) has an objective \({\mathbbm {1}}^{T} x'\) at least as large as \( \frac{1-\delta }{1+\varepsilon }\mathsf {OPT}\).

\(\square \)

Some non-standard smoothness properties The gradient and Hessian of \(f_\mu (x)\) can be written in the following closed forms:

Fact 2.6

\(\nabla f_\mu (x) = A^T p(x) - {\mathbbm {1}}\) and \(\nabla ^2 f_\mu (x) = \frac{1}{\mu } A^T \mathrm {diag}\{p(x)\}A\), where \(p_j(x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}e^{\frac{1}{\mu } ((Ax)_j - 1)}\).

By staring at these closed forms, we note that \(f_\mu (x)\) is not Lipschitz-smooth: for instance, each \(\nabla ^2_{ii} f_\mu (x)\) can go to infinity so the spectral norm of \(\nabla ^2 f_\mu (x)\) is unbounded. However, the non-negativity of A guarantees that whenever \(\nabla ^2_{ii} f_\mu (x)\) is large for some coordinate i, the corresponding entry of the gradient \(\nabla _i f_\mu (x)\) must also be large. This still allows us to take a larger step in direction \(\mathbf {e}_i\) than traditionally allowed by coordinate descent.

The above intuition is formalized in the next lemma, whose proof is by simple manipulation of Hessian. The first half of the lemma is the same as the traditional coordinate Lipschitz-smoothness property, but holds only conditionally; the second half is a salient characteristic of this work and requires the non-negativity of A. These smoothness properties will be crucial in applying gradient descent arguments in Sect. 3.3, and are the main motivation for us to adopt the \(\Vert \cdot \Vert _A\) norm for our proposed algorithms.

Lemma 2.7

Let \(L {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{4}{\mu }.\) Then, for every \(x \ge 0\), every \(i \in [n]\), and every \(\lambda \in \big [- \frac{1}{L \Vert A_{:i}\Vert _{\infty }} , \frac{1}{L \Vert A_{:i}\Vert _{\infty }}\big ]\):

  1. (a)

    If \(|\nabla _i f_\mu (x)| \le 1,\) then \( \big |\nabla _i f_\mu (x + \lambda \mathbf {e}_i) - \nabla _i f_\mu (x) \big | \le L \Vert A_{:i}\Vert _{\infty }\cdot | \lambda | . \)

  2. (b)

    If \(\nabla _i f_\mu (x) \ge 1,\) then \( \nabla _i f_\mu (x + \lambda \mathbf {e}_i) \ge \left( 1-\frac{\Vert A_{:i}\Vert _{\infty }L}{2} |\lambda |\right) \nabla _i f_\mu (x) . \)

Proof of Lemma 2.7

Using the fact that \(\nabla _i f_\mu (x) > -1\) for all x,  we have:

$$\begin{aligned} \Big | \log \frac{\nabla _i f_\mu (x + \lambda \mathbf {e}_i) +1}{\nabla _i f_\mu (x)+1} \Big |&\overset{\textcircled {{\small 1}}}{=} \Big | \int ^{\lambda }_0 \frac{\nabla ^2_{ii} f_\mu (x+\nu \mathbf {e}_i)}{\nabla _i f_\mu (x+\nu \mathbf {e}_i) +1} d \nu \Big | \\&\overset{\textcircled {{\small 2}}}{=} \frac{1}{\mu } \Big | \int ^{\lambda }_0 \frac{(A^T \mathrm {diag}\{p(x + \nu \mathbf {e}_i)\} A)_{ii}}{(A^T p(x + \nu \mathbf {e}_i))_i}d \nu \Big |\\&\overset{\textcircled {{\small 3}}}{\le }\frac{\Vert A_{:i}\Vert _{\infty }}{\mu } | \lambda | \overset{\textcircled {{\small 4}}}{=} \frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda | . \end{aligned}$$

Above, \(\textcircled {{\small 1}}\)holds because \(\int _{0}^\lambda g'(\nu ) d \nu = g(\lambda ) - g(0)\) where \(g(\nu ) = \log (\nabla _i f_\mu (x+\nu \mathbf {e}_i)+1)\); \(\textcircled {{\small 2}}\)holds according to Fact 2.6; \(\textcircled {{\small 3}}\)is because the numerator is \(\sum _j A_{j,i}^2 p_j\) while the denominator is \(\sum _j A_{j,i} p_j\); \(\textcircled {{\small 4}}\)holds because \(L = \frac{4}{\mu }\). This immediately implies

$$\begin{aligned} e^{-\frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda |} \le \frac{\nabla _i f_\mu (x + \lambda \mathbf {e}_i) +1}{\nabla _i f_\mu (x)+1} \le e^{\frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda |}. \end{aligned}$$

Our assumption on \(\lambda \) implies \(\frac{\Vert A_{:i}\Vert _{\infty }L}{4} | \lambda | \le \frac{1}{4},\) so that we can use the approximation \( x \le e^{x} -1 \le 1.2 x\) over \(x \in [-\frac{1}{4}, \frac{1}{4}].\) This yields the simpler bound:

$$\begin{aligned} - \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda | \le \frac{\nabla _i f_\mu (x + \lambda \mathbf {e}_i) - \nabla _i f_\mu (x)}{\nabla _i f_\mu (x)+1} \le 1.2 \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda |. \end{aligned}$$
  1. (a)

    Assuming that \(\nabla _i f_\mu (x) \in (-1,1],\) we have:

    $$\begin{aligned} \Big |\nabla _i f_\mu (x + \lambda \mathbf {e}_i) - \nabla _i f_\mu (x)\Big | \le 2.4 \cdot \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda | \le \Vert A_{:i}\Vert _{\infty }L |\lambda | . \end{aligned}$$
  2. (b)

    Assuming \(\nabla _i f_\mu (x) \ge 1\), we have

    $$\begin{aligned} \nabla _i f_\mu (x+\lambda \mathbf {e}_i)\ge & {} \nabla _i f_\mu (x) - \frac{\Vert A_{:i}\Vert _{\infty }L}{4} |\lambda | \left( \nabla _i f_\mu (x) + 1 \right) \\\ge & {} \left( 1-\frac{\Vert A_{:i}\Vert _{\infty }L}{2} |\lambda |\right) \nabla _i f_\mu (x) . \end{aligned}$$

    \(\square \)

Initialization Iterative methods require a starting point, and we use the following one

Fact 2.8

Let \(x^{\mathsf {start}}_i {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{1-\varepsilon /2}{n\Vert A_{:i}\Vert _{\infty }}\) for each \(i\in [n]\). Then, \(x^{\mathsf {start}}\in \varDelta _{\mathsf {box}}\) and \(f_\mu (x^{\mathsf {start}}) \le -\frac{1-\varepsilon }{n}\).

Proof

Using the fact that \(Ax^{\mathsf {start}}- {\mathbbm {1}}\le -\varepsilon /2 \cdot {\mathbbm {1}}\), we compute \(f_\mu (x^{\mathsf {start}})\) as follows:

$$\begin{aligned} f_\mu (x^{\mathsf {start}})&= \mu \sum _j e^{\frac{1}{\mu } ((Ax^{\mathsf {start}})_j - 1)} - {\mathbbm {1}}^{T} x^{\mathsf {start}}\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} - \frac{1-\varepsilon /2}{n} \\&\le \frac{\mu m}{(nm)^2} - \frac{1-\varepsilon /2}{n} \le -\frac{1-\varepsilon }{n} . \end{aligned}$$

Above, we have used \({\mathbbm {1}}^{T} x^{\mathsf {start}}\ge x^{\mathsf {start}}_{i} = \frac{1-\varepsilon /2}{n}\), where i is the column s.t. \(\Vert A_{:i}\Vert _{\infty }=1\). \(\square \)

3 Our packing LP solver

Recall traditional (accelerated or not) gradient descent [28, 29] or coordinate descent [6, 17, 30] should not be applied directly to minimize \(f_\mu \), because \(f_\mu \) is not Lipschitz-smooth.

Our proposed algorithm \(PacLPSolver\) starts with some initial vector \(\mathsf {x}_0 = \mathsf {y}_0 = x^{\mathsf {start}}\) (see Fact 2.8) and \(\mathsf {z}_0 = 0\), and is divided into T iterations. In each iteration k, it computes a weighted midpoint \(\mathsf {x}_k \leftarrow \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}\) for some parameter \(\tau \in (0,1)\). This step is analogous to that in traditional accelerated coordinate descent [6, 17, 30]. We then compute \(\mathsf {y}_k\) and \(\mathsf {z}_k\) as follows.

We select \(i\in [n]\) uniformly at random. Let \(\xi _k^{(i)} = (0,\dots ,0,\mathbb {T}^\mathsf {p}(v),0,\dots ,0)\) be the vector that is only non-zero at coordinate i, where \(v = \nabla _i f_\mu (\mathsf {x}_k) \in [-1,\infty )\), and \(\mathbb {T}^\mathsf {p}(v)\) is the thresholding function \(\mathbb {T}^\mathsf {p}(v) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\min \{v, 1\}\). We refer to \(\xi _k^{(i)}\) as the truncated gradient.Footnote 9 Next,

  • Perform a mirror (descent) step\(\mathsf {z}_k \leftarrow \mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}\) for some parameter \(\alpha _k \ll 1/n\) to be chosen later.

  • Perform a gradient (descent) step\(\mathsf {y}_k \leftarrow \mathsf {y}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mathsf {x}_k + \frac{1}{n \alpha _k L} (\mathsf {z}_k^{(i)} - \mathsf {z}_{k-1})\).

This finishes the description of our \(PacLPSolver\).

Remark 3.1

We use the superscript \(^{(i)}\) on \(\xi _k^{(i)}\), \(\mathsf {y}_k^{(i)}\) and \(\mathsf {z}_k^{(i)}\) to emphasize that the value depends on the choice of i. We use generic parameters \(\tau ,\alpha _k,T\) in the above description and their precise values are presented in Algorithm 1.

figure a

Our update on \(\mathsf {y}_k\) is a “gradient descent step” because we shall prove that it strictly decreases the objective (i.e., \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\ge 0\)). Our update on \(\mathsf {z}_k\) is a “mirror descent step” because we shall apply standard mirror descent analysis [12] to it. We explicitly describe how to implement the mirror step (its proof is straightforward by computing the gradient):

Proposition 3.2

If \(\varDelta _{\mathsf {box}}= \{ x \in \mathbb {R}^n \,:\, x_i \in \big [0, \frac{C}{\Vert A_{:i}\Vert _{\infty }}\big ] \}\) for some constant \(C > 0\), the minimizer \(z = {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle \delta \mathbf {e}_i, z \rangle \big \}\) for any \(\delta \in \mathbb {R}\) and basis vector \(\mathbf {e}_i\) can be computed as follows:

  1. 1.

    \(z \leftarrow \mathsf {z}_{k-1}\).

  2. 2.

    \(z_i \leftarrow z_i - \delta /\Vert A_{:i}\Vert _{\infty }\).

  3. 3.

    If \(z_i < 0\), then \(z_i \leftarrow 0\); if \(z_i > C/\Vert A_{:i}\Vert _{\infty }\), \(z_i \leftarrow C/\Vert A_{:i}\Vert _{\infty }\).

  4. 4.

    Return z.

We also point out that

Lemma 3.3

Each iteration of \(PacLPSolver\) can be implemented to run in expected O(N / n) time. The total expected running time is O(TN / n).

Lemma 3.3 is not hard to prove, but anyways included in “Appendix E.5”. It follows from standard implementation tricks which compute \(\mathsf {x}_k\) and \(\mathsf {y}_k\) only implicitly: that is to express \(\mathsf {x}_k\) and \(\mathsf {y}_k\) as linear combinations of two less-frequently-updated vectors.

3.1 Convergence statement

In this section, we focus on proving the following main theorem.

figure b

It is straightforward to use Markov’s bound to turn Theorem 3.4 into a probabilistic one

figure c

Proof of Corollary 3.5

Since for every \(x\in \varDelta _{\mathsf {box}}\) it satisfies \(f_\mu (x) \ge -(1+\varepsilon )\mathsf {OPT}\) according to Proposition 2.4.b, we obtain that \(f_\mu (y_T) + (1+\varepsilon )\mathsf {OPT}\) is a random variable that is non-negative, whose expectation \(\mathbf {E}[f_\mu (y_T) + (1+\varepsilon )\mathsf {OPT}] \le 4\varepsilon \) according to Theorem 3.4. By Markov bound, with at least probability 2 / 3, we obtain some \(y_T\) satisfying \(f_\mu (y_T) \le - (1-11\varepsilon )\mathsf {OPT}\), which yields a \((1-O(\varepsilon ))\) approximate solution to packing LP according to Proposition 2.4.c. The running time follows from Lemma 3.3. \(\square \)

Before we prove prove Theorem 3.4 in subsequent subsections, let us first point out that our iterates \(\mathsf {x}_k,\mathsf {y}_k,z_k\) never leave the bounding box \(\varDelta _{\mathsf {box}}\):

Lemma 3.6

We have \(\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}\) for all \(k=0,1,\dots ,T\).

(The proof of Lemma 3.6 is included in “Appendix A.1”, and the main technique already appeared in randomized coordinate descent [17].)

3.2 Step 1: mirror descent guarantee

Following almost classical analysis of mirror descent (cf. textbook [12]), our update \(\mathsf {z}_k^{(i)} = {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}\) satisfies

Lemma 3.7

(mirror descent) For every \(u\in \varDelta _{\mathsf {box}}\), it satisfies

$$\begin{aligned} \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - u \big \rangle \le n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 . \end{aligned}$$

Proof

Denoting by \(V_a(b) = \frac{1}{2}\Vert b - a\Vert _A^2\) as a function of \(b \in \varDelta _{\mathsf {box}}\) parameterized at \(a \in \varDelta _{\mathsf {box}}\), we have \(\nabla _i V_a(b) = \Vert A_{:i}\Vert _{\infty }\cdot (a_i - b_i)\). In the optimization language, \(V_a(b)\) is the Bregman divergence of the \(\Vert \cdot \Vert _A^2\) regularizer [12]. We derive the following sequence of inequalities:

$$\begin{aligned} \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - u \big \rangle&= \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + \big \langle n \alpha _k \xi _{k}^{(i)}, \mathsf {z}_k^{(i)} - u \big \rangle \\&\overset{\textcircled {{\small 1}}}{\le }\big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + \big \langle - \nabla V_{\mathsf {z}_{k-1}}(\mathsf {z}_k^{(i)}), \mathsf {z}_k^{(i)} - u \big \rangle \\&\overset{\textcircled {{\small 2}}}{=} \big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle - \frac{1}{2}\Vert \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)}\Vert _A^2 + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2\\&\quad - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \\&\overset{\textcircled {{\small 3}}}{=} n^2 \alpha _k^2 L \left( \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k \big \rangle - \frac{L}{2}\Vert \mathsf {x}_{k} - \mathsf {y}_k\Vert _A^2 \right) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 \\&\quad - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \\&\le n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k \big \rangle + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 . \end{aligned}$$

Above, \(\textcircled {{\small 1}}\)is due to the minimality of \(\mathsf {z}_k^{(i)} = {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{V_{\mathsf {z}_{k-1}}(z) + \big \langle n\alpha _k \xi _{k}^{(i)}, z \big \rangle \big \}\), which implies that \( \big \langle \nabla V_{\mathsf {z}_{k-1}}(\mathsf {z}_k^{(i)}) + n \alpha \xi _k^{(i)}, u - \mathsf {z}_k^{(i)} \big \rangle \ge 0\) for all \(u\in \varDelta _{\mathsf {box}}\). Equality \(\textcircled {{\small 2}}\)can be checked for every coordinate \(\ell \in [n]\) as follows:

$$\begin{aligned} - \nabla _\ell V_{\mathsf {z}_{k-1}}(\mathsf {z}_k^{(i)}) \cdot (\mathsf {z}_{k,\ell }^{(i)}-u_\ell )&= \Vert A_{:i}\Vert _{\infty }(\mathsf {z}_{k-1,\ell } - \mathsf {z}_{k,\ell }^{(i)}) \cdot (\mathsf {z}_{k,\ell }^{(i)}-u_\ell ) \\&= \Vert A_{:i}\Vert _{\infty }\left( - \frac{1}{2}(\mathsf {z}_{k-1,\ell }-\mathsf {z}_{k,\ell }^{(i)})^2 + \frac{1}{2}(u_\ell - \mathsf {z}_{k-1,\ell })^2 \right. \\&\quad \left. - \frac{1}{2}(\mathsf {z}_{k,\ell }^{(i)} - u_\ell )^2\right) . \end{aligned}$$

\(\textcircled {{\small 3}}\)is by our choice of \(\mathsf {y}_k\) which satisfies that \(\mathsf {z}_{k-1} -\mathsf {z}_k^{(i)} = n \alpha _k L (\mathsf {x}_k - \mathsf {y}_k^{(i)})\). \(\square \)

In addition, as a simple corollary of Proposition 3.2, we have the following fact

Fact 3.8

\(|\mathsf {z}_{k,i}^{(i)} - \mathsf {z}_{k-1,i}| \le \frac{n \alpha _k |\xi _{k,i}^{(i)}|}{\Vert A_{:i}\Vert _{\infty }}\) and \(|\mathsf {y}_{k,i}^{(i)} - \mathsf {x}_{k,i}| = \frac{1}{n \alpha _k L} |\mathsf {z}_{k,i}^{(i)} - \mathsf {z}_{k-1,i}| \le \frac{|\xi _{k,i}^{(k)}|}{L\Vert A_{:i}\Vert _{\infty }} \le \frac{1}{L\Vert A_{:i}\Vert _{\infty }}\). If \(\xi _{k,i}^{(i)} \ge 0\), then \(\mathsf {z}_{k,i}^{(i)} \le \mathsf {z}_{k-1,i}\) and \(\mathsf {y}_{k,i}^{(i)} \le \mathsf {x}_{k,i}\); if \(\xi _{k,i}^{(i)} \le 0\), then \(\mathsf {z}_{k,i}^{(i)} \ge \mathsf {z}_{k-1,i}\) and \(\mathsf {y}_{k,i}^{(i)} \ge \mathsf {x}_{k,i}\).

3.3 Step 2: gradient descent guarantee

We call our update \(\mathsf {y}_k^{(i)} \leftarrow \mathsf {x}_k + \frac{1}{n \alpha _k L} (\mathsf {z}_k^{(i)} - \mathsf {z}_{k-1})\) a gradient descent step, because the following lemma guarantees \(f_\mu (\mathsf {y}_k^{(i)}) \le f_\mu (\mathsf {x}_k)\), that is, the objective only decreases; moreover, the objective decreases at least by \(\frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle \).

Lemma 3.9

(gradient descent) We have \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \rangle \ge 0\).

This Lemma 3.9, which is characteristic of the PC-LP setting, is strong in the following sense. Even though the update \(y_k^{(i)}\) only depends on the truncated gradient \(\xi ^{(i)}_{k}\), the progress we make is a function of the true gradient \(\nabla _i f_\mu (\mathsf {x}_k)\), including the large component that was discarded. This is possible because the smoothness guarantee of Lemma 2.7.b allows us to take a long coordinate step even though \(f_\mu (x)\) is not Lipschitz-smooth.

Proof of Lemma 3.9

Using Fact 3.8, we write \(\mathsf {y}_k^{(i)} = \mathsf {x}_k + s \lambda \mathbf {e}_i\) for some \(s\in \{-1,+1\}\) and step length \(\lambda \in \big [0, \frac{1}{L \Vert A_{:i}\Vert _{\infty }}\big ]\). We first focus on the case \(\nabla _i f_\mu (\mathsf {x}_k) \in [-1,1]\) so \(\xi _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k)\).

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) =f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {x}_k + s \lambda \mathbf {e}_i) = - s \int _0^\lambda \left( \nabla _i f_\mu (\mathsf {x}_k + s \chi \mathbf {e}_i) \right) d \chi \\&\quad \overset{\textcircled {{\small 1}}}{\ge }s \int _0^\lambda \left( - \nabla _i f_\mu (\mathsf {x}_k) - L \Vert A_{:i}\Vert _{\infty }\cdot s \chi \right) d \chi = - \nabla _i f_\mu (\mathsf {x}_k) \cdot s \lambda - \frac{L \Vert A_{:i}\Vert _{\infty }}{2} \cdot \lambda ^2 \\&\quad \overset{\textcircled {{\small 2}}}{\ge }- \nabla _i f_\mu (\mathsf {x}_k) \cdot s \lambda - \frac{L \Vert A_{:i}\Vert _{\infty }}{2} \cdot \lambda \cdot \frac{|\xi _{k,i}^{(k)}|}{L\Vert A_{:i}\Vert _{\infty }} \overset{\textcircled {{\small 3}}}{=} - \frac{1}{2}\langle \nabla f_\mu (\mathsf {x}_k) , \mathsf {y}_k^{(i)} - \mathsf {x}_k \rangle . \end{aligned}$$

Above, \(\textcircled {{\small 1}}\)uses Lemma 2.7.a, \(\textcircled {{\small 2}}\)uses Fact 3.8, and \(\textcircled {{\small 3}}\)uses \(|\xi _{k,i}^{(k)}| = -s \nabla _i f_\mu (\mathsf {x}_k)\) (see also Fact 3.8). Next, we turn to the case of \(\nabla _i f_\mu (\mathsf {x}_k) > 1\). In this case, we have \(s=-1\) and

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) =f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {x}_k - \lambda \mathbf {e}_i) = \int _0^\lambda \nabla _i f_\mu (\mathsf {x}_k - \chi \mathbf {e}_i) d \chi \\&\quad \overset{\textcircled {{\small 1}}}{\ge }\int _0^\lambda \left( 1-\frac{\Vert A_{:i}\Vert _{\infty }L}{2} \chi \right) \nabla _i f_\mu (x) d \chi \\&\quad \overset{\textcircled {{\small 2}}}{\ge }\int _0^\lambda \frac{1}{2} \nabla _i f_\mu (x) d \chi = \frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle . \end{aligned}$$

Above, \(\textcircled {{\small 1}}\)uses Lemma 2.7.b and \(\textcircled {{\small 2}}\)uses \(\chi \le \lambda \le \frac{1}{L \Vert A_{:i}\Vert _{\infty }}\). Finally, we have \(\langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle \ge 0\) because \(\nabla _i f_\mu (\mathsf {x}_k)\) and \(\mathsf {x}_{k,i} - \mathsf {y}_{k,i}^{(i)}\) have the same sign, and \(\mathsf {x}_{k,\ell } = \mathsf {y}_{k,\ell }^{(i)}\) for \(\ell \ne i\).

3.4 Step 3: putting all together

We denote by \(\eta _{k}^{(i)} \in \mathbb {R}_{\ge 0}^n \) the vector that is only non-zero at coordinate i, and satisfies \(\eta _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) - \xi _{k,i}^{(i)} \in [0,\infty )\). In other words, the full gradient

$$\begin{aligned} \nabla f_\mu (\mathsf {x}_k) = \mathbf {E}_i[(0,\dots ,n \nabla _i f_\mu (\mathsf {x}_k),\dots ,0)] = \mathbf {E}_i[ n \eta _{k}^{(i)} + n \xi _k^{(i)} ] \end{aligned}$$

can be (in expectation) decomposed into the a large non-negative component \(\eta _k^{(i)} \in [0,\infty )^n\) and a truncated component \(\xi _k^{(i)} \in [-1,1]^n\). Recall that \(\eta _k^{(i)}\) did not contribute to the descent steps (see Line 9 of \(PacLPSolver\)). Now, for any \(u \in \varDelta _{\mathsf {box}}\), we can use a basic convexity argument and the mirror descent lemma to compute that

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1}-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{=} \frac{(1-\tau ) \alpha _k}{\tau } \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {y}_{k-1} - \mathsf {x}_k \big \rangle \nonumber \\&\qquad + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \end{aligned}$$
(3.1)
$$\begin{aligned}&\quad \overset{\textcircled {{\small 2}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [\; \boxed {\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle } \nonumber \\&\qquad + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \Big ] \end{aligned}$$
(3.2)

Above, \(\textcircled {{\small 1}}\)is because \(\mathsf {x}_{k} = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\), which implies that \(\tau (\mathsf {x}_k - \mathsf {z}_{k-1}) = (1-\tau ) (\mathsf {y}_{k-1} - \mathsf {x}_k)\). \(\textcircled {{\small 2}}\)uses convexity and Lemma 3.7. This above computation is motivated by [5], and as we shall see below, it allows one to linearly couple gradient and mirror steps.

Intuitively, the first term in the box of (3.2) is the loss introduced by the large gradient \(\eta _k^{(i)}\). This part was truncated so did not contribute to the mirror step. The second term in the box is the loss introduced by mirror descent on the small gradient \(\xi _k^{(i)}\) in Lemma 3.7.

Now comes an important observation. As shown by Lemma 3.10 below, the performance of the gradient step—that is, the objective decrease of \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\)—is at least proportional to the total loss incurred in the box. Intuitively, this means that the progress in the gradient step is so large that it outweighs not only the loss from mirror descent (as is typical in accelerated gradient analyses [5, 29]) but also the loss term introduced by \(\eta _k^{(i)}\).

Lemma 3.10

(gradient descent total guarantee) For every \(u\ge 0\),

$$\begin{aligned} \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 3n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})). \end{aligned}$$

The proof of Lemma 3.10 is a careful case analysis and several simple applications of Lemma 3.9. We remark that to properly upper bound \(\langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \rangle \), one needs to have some good upper bound the coordinates of \(\mathsf {z}_{k-1}\). This is exactly the place we need our redundant constraint which ensures \(\mathsf {z}_{k-1,i} \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}\) (see Remark 2.5).

Proof of Lemma 3.10

There are three possibilities:

  • If \(\eta _{k,i}^{(i)}=0\), then we must have \(\xi _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) \in [-1,1]\). Lemma 3.9 implies

    $$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad = n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 2 n^2 \alpha _k^2 L \cdot ( f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) ) \end{aligned}$$
  • If \(\eta _{k,i}^{(i)} > 0\) and \(\mathsf {z}_{k,i}^{(i)} > 0\), then we precisely have \(\mathsf {z}_{k,i}^{(i)} = \mathsf {z}_{k-1,i} - \frac{n\alpha _k}{\Vert A_{:i}\Vert _{\infty }}\) (see Proposition 3.2), and accordingly \(\mathsf {y}_{k,i}^{(i)} = \mathsf {x}_{k,i} - \frac{1}{L \Vert A_{:i}\Vert _{\infty }} < \mathsf {x}_{k,i}\). In this case,

    $$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{1}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{<} n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{1}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} n \alpha _k L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }\big ( 2 n \alpha _k L + 2 n^2 \alpha _k^2 L \big ) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

    Above, \(\textcircled {{\small 1}}\)follows from the fact that \(\mathsf {z}_{k-1} \in \varDelta _{\mathsf {box}}\) and therefore \(\mathsf {z}_{k-1,i} \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}\) by the definition of \(\varDelta _{\mathsf {box}}\), and \(u\ge 0\); \(\textcircled {{\small 2}}\)follows from the fact that \(\mathsf {x}_k\) and \(\mathsf {y}_k^{(i)}\) are only different at coordinate i, and \(\xi _{k,i}^{(i)}=1 < \nabla _i f_\mu (\mathsf {x}_k)\) (since \(\eta _{k,i}^{(i)}>0\)); \(\textcircled {{\small 3}}\)follows from the fact that \(\mathsf {y}_{k}^{(i)} = \mathsf {x}_{k} - \frac{\mathbf {e}_i}{L \Vert A_{:i}\Vert _{\infty }}\); and \(\textcircled {{\small 4}}\)uses Lemma 3.9.

  • If \(\eta _{k,i}^{(i)} > 0\) and \(\mathsf {z}_{k,i}^{(i)} = 0\), then we have

    $$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }\big ( n \alpha _k \nabla _i f_\mu (\mathsf {x}_k) \cdot \mathsf {z}_{k-1,i} \big ) + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{=} \big \langle n \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }4 n^2 \alpha _k^2 L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

    Above, \(\textcircled {{\small 1}}\)is because \(u \ge 0\), \(\nabla _i f_\mu (\mathsf {x}_k) = \eta _{k,i}^{(i)} + 1 > \eta _{k,i}^{(i)}\) and \(\nabla _i f_\mu (\mathsf {x}_k) > \xi _{k,i}^{(i)}\); \(\textcircled {{\small 2}}\)uses the assumption that \(\mathsf {z}_{k,i}^{(i)} = 0\) and the fact that \(\mathsf {z}_{k-1,\ell }=\mathsf {z}_{k,\ell }^{(i)}\) for every \(\ell \ne i\); \(\textcircled {{\small 3}}\)is from our choice of \(\mathsf {y}_k\) which satisfies that \(\mathsf {z}_{k-1} -\mathsf {z}_k^{(i)} = n \alpha _k L (\mathsf {x}_k - \mathsf {y}_k^{(i)})\); and \(\textcircled {{\small 4}}\)uses Lemma 3.9.

Combining the three cases, and using the fact that \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\), we conclude that

$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \le (2n \alpha _k L + 4n^2 \alpha _k^2 L) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) \\&\quad \le 3 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

Above, the last inequality uses our choice of \(\alpha _k\), which implies \(n \alpha _k \le n \alpha _T = \frac{1}{\varepsilon L} \le \frac{1}{4}\).

Plugging Lemma 3.10 back to (3.2), we have

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\quad \quad + \mathbf {E}_{i} \Big [ 3 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] \nonumber \\&\quad \overset{\textcircled {{\small 2}}}{\le }\alpha _k f_\mu (\mathsf {x}_k) + \big (3n\alpha _k L - \alpha _k \big ) f_\mu (\mathsf {y}_{k-1}) \nonumber \\&\quad \quad + \mathbf {E}_{i} \Big [ -3 n \alpha _k L \cdot f_\mu (\mathsf {y}_k^{(i)}) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] . \end{aligned}$$
(3.3)

Above, \(\textcircled {{\small 1}}\)uses Lemma 3.10; and \(\textcircled {{\small 2}}\)is because we have chosen \(\tau \) to satisfy \(\frac{1}{\tau } = 3 n L \).

Next, recall that we have picked \(\alpha _{k}\) so that \((3n L - 1) \alpha _k = 3n L \cdot \alpha _{k-1}\) in Algorithm 1. Telescoping (3.3) for \(k=1,\dots ,T\) and choosing \(u^*=(1-\varepsilon /2) x^*\), we have

$$\begin{aligned}&- \sum _{k=1}^T \alpha _k f_\mu (u^*) \le 3 f_\mu (\mathsf {y}_0) - 3 n\alpha _T L \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + \Vert \mathsf {z}_0 - u^*\Vert _A^2 \\&\quad \le - 3 n\alpha _T L \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + \mathsf {OPT}. \end{aligned}$$

Here, the second inequality is due to \(f_\mu (\mathsf {y}_0) = f_\mu (x^{\mathsf {start}}) \le 0\) from Fact 2.8, and the fact that

$$\begin{aligned}\Vert \mathsf {z}_0 - u^*\Vert _A^2 = \Vert u^*\Vert _A^2 = \sum _{i=1}^n (u^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\le \sum _{i=1}^n (x^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\le \sum _{i=1}^n x^*_i = \mathsf {OPT}. \end{aligned}$$

Finally, using the fact that \(\sum _{k=1}^T \alpha _k = \alpha _T \cdot \sum _{k=0}^{T-1} \big (1 - \frac{1}{3nL}\big )^k = 3 n \alpha _T L \big (1-(1-\frac{1}{3nL})^T\big )\), we rearrange and obtain that

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le&\frac{\sum _k \alpha _k}{3 n\alpha _T L} f_\mu (u^*) + \frac{1}{3 n\alpha _T L} \mathsf {OPT}= \big (1-\big (1-\frac{1}{3nL}\big )^T\big ) f_\mu (u^*) \\&+ \frac{1}{3 n\alpha _T L} \mathsf {OPT}. \end{aligned}$$

We choose \(T = \lceil 3 n L \log (1/\varepsilon ) \rceil \) so that \(\frac{1}{n\alpha _T L} = (1-\frac{1}{3 n L})^T \le \varepsilon \). Combining this with the fact that \(f_\mu (u^*)\le -(1-\varepsilon )\mathsf {OPT}<0\) (see Proposition 2.4.a), we obtain

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le (1-\varepsilon )f_\mu (u^*) + \varepsilon /3 \cdot \mathsf {OPT}< -(1-3\varepsilon )\mathsf {OPT}. \end{aligned}$$

Therefore, we have finished proving Theorem 3.4. \(\square \)

4 Relaxation of the covering linear program

Since \(PacLPSolver\) gives only an approximate packing LP solution, we cannot infer from it a dual covering LP solution. Therefore, we have to work on a relaxed version of covering LP directly. For input matrix \(A \in \mathbb {R}_{\ge 0}^{m \times n}\), we rewrite the covering LP problem (1.2) as follows in order to be notationally close to packing LP:

$$\begin{aligned} \min _{x \ge 0 } \{ {\mathbbm {1}}^{T} x \,:\, A x \ge {\mathbbm {1}}\} . \end{aligned}$$
(4.1)

We denote by \(\mathsf {OPT}\) the optimal value to this LP, and by \(x^*\) any of its optimal solutions. We say that x is a \((1+\varepsilon )\)-approximation for the covering LP if \(A x \ge {\mathbbm {1}}\) and \({\mathbbm {1}}^{T} x \le (1+\varepsilon )\mathsf {OPT}\).

Again, we use indices \(i\in [n]\) for the columns of A, and indices \(j\in [m]\) for the rows of A. We denote by \(A_{:i}\) the i-th column vector of A, and \(A_{j :}\) the j-th row vector of A. We assume without loss of generality by simple scaling thatFootnote 10

$$\begin{aligned} \min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\}=1 . \end{aligned}$$
(4.2)

Proposition 4.1

The normalization (4.2) ensures \(\mathsf {OPT}\in [1,m]\).

Proof

Suppose that \(j^*\) is the row that achieves the smallest infinite norm \(\Vert A_{j :}\Vert _{\infty }\) over all rows \(j\in [m]\). Then, for any solution \(x \in \mathbb {R}_{\ge 0}^n\) satisfying \(\langle A_{j^* :}, x \rangle \ge 1\), we must have \({\mathbbm {1}}^{T} x \ge 1 / \Vert A_{j^* :} \Vert _{\infty } = 1\) using (4.2). On the other hand, we can construct a feasible solution x as follows. Initialize \(x = 0\), and then for each row j, let us find the coordinate i that maximizes the value of \(A_{ij}\) among all columns i. Then, we increase \(x_i\) by \(1/ A_{ij} = 1 / \Vert A_{j :}\Vert _{\infty }\). After we have exhausted all the m rows, we arrive at some \(x \ge 0\) satisfying \(Ax \ge {\mathbbm {1}}\) as well as \({\mathbbm {1}}^{T} x = \sum _j 1 / \Vert A_{j :}\Vert _{\infty }\le m\). \(\square \)

In our covering LP solvers, we assume that an initial solution, achieving a constant approximation, is available to the algorithm. Such a solution can be obtained for instance by the covering LP solver from Young [36] with constant \(\epsilon \) in time \(O(N \log N)\).

Definition 4.2

Let \(x^{\sharp }\) be a given 2-approximate solution to the covering problem given and let \(\mathsf {OPT}' {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\mathbbm {1}}^{T} x^{\sharp } \in [\mathsf {OPT}, 2\mathsf {OPT}]\). Without loss of generality, assume \(\mathsf {OPT}' \ge 2\).

We now introduce the smoothed objective \(f_\mu (x)\) we are going to minimize in order to solve covering LP. Symmetric to the case in the packing solver, this smoothed objective \(f_\mu (x)\) turns each row of the LP constraint \(Ax \ge {\mathbbm {1}}\) into an exponential penalty function.

Definition 4.3

Letting \(\mu {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{\varepsilon }{4 \log (nm/\varepsilon )}\), we define the smoothed objective \(f_\mu (x)\) as

$$\begin{aligned} \textstyle f_\mu (x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mu \sum _{j=1}^m e^{\frac{1}{\mu } (1 - (Ax)_j)} + {\mathbbm {1}}^{T} x . \end{aligned}$$

Fact 4.4

\(\nabla f_\mu (x) = {\mathbbm {1}}- A^T p(x)\) and \(\nabla ^2 f_\mu (x) = \frac{1}{\mu } A^T \mathrm {diag}\{p(x)\}A\), where \(p_j(x) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}e^{\frac{1}{\mu } (1 - (Ax)_j)}\).

We present some properties about \(f_\mu (x)\). They together imply that the minimum of \(f_\mu (x)\) is around \(\mathsf {OPT}\), and if one approximately finds the minimum of \(f_\mu (x)\) up to an additive error \(O(\varepsilon \mathsf {OPT})\), this corresponds to a \((1+O(\varepsilon ))\)-approximate solution to the covering LP (4.1). The proofs are analogous to Sect. 2, and included in “Appendix B.2” for completeness’ sake.

Proposition 4.5

  1. (a)

    \(f_\mu (u^*) \le (1+\varepsilon )\mathsf {OPT}\) for \(u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) x^*\).

  2. (b)

    \(f_\mu (x) \ge (1-\varepsilon )\mathsf {OPT}\) for every \(x \ge 0\).

  3. (c)

    For any \(x \ge 0\) satisfying \(f_\mu (x) \le 2\mathsf {OPT}\), we must have \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\).

  4. (d)

    If \(x\ge 0\) satisfies \(f_\mu (x) \le (1+\delta )\mathsf {OPT}\) for some \(\delta \in [0,1]\), then \(\frac{1}{1-\varepsilon } x\) is a \(\frac{1+\delta }{1-\varepsilon }\)-approximate solution to the covering LP.

5 Our covering LP solver in the well-conditioned case

Recall in packing LPs, since it satisfies \(0 \le x^*_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}\) (see Fact 2.1), we can minimize \(f_\mu \) over a bounding box \(\varDelta _{\mathsf {box}}\). Unfortunately, it no longer satisfies \(x^*_i \le \frac{1}{\Vert A_{:i}\Vert _{\infty }}\) in covering LPs, so one cannot directly turn \(PacLPSolver\) into its symmetric version to solve covering LP.

In this section, we show that this symmetric covering LP solver still solves all well-behaved covering LP instances. Specifically, we say the covering LP is well-behaved if:Footnote 11

Assumption 5.1

There exists some optimal covering LP solution \(x^*\) satisfying \(x^*_i \le \frac{9}{\Vert A_{:i}\Vert _{\infty }}\); and the initial point \(x^{\sharp }\) satisfies \(x^{\sharp }_i \le \frac{9}{\Vert A_{:i}\Vert _{\infty }}\).

For instance, well-behaved instances naturally arise from those where the constraints \(A x \ge {\mathbbm {1}}\) are non-redundant. If the optimal covering LP solution \(x^*\) and the initial point \(x^{\sharp }\) satisfy \({\mathbbm {1}}\le A x^* \le 9 \cdot {\mathbbm {1}}\) and \({\mathbbm {1}}\le A x^{\sharp } \le 9 \cdot {\mathbbm {1}}\), then Assumption 5.1 is satisfied.

Well-behaved covering LP problems immediately satisfy the following:

Fact 5.2

Define \(\varDelta _{\mathsf {box}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ x \in \mathbb {R}^n \,:\, x_i \in \big [0, \frac{10}{\Vert A_{:i}\Vert _{\infty }}\big ] \}\). Under Assumption 5.1, we have \(u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2)x^* \in \varDelta _{\mathsf {box}}\) and \(x^{\mathsf {start}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) \cdot x^{\sharp } \in \varDelta _{\mathsf {box}}\). Also, it satisfies \(f_\mu (x^{\mathsf {start}}) \le 3 \mathsf {OPT}\).

Proof

The claims \(u^*, x^{\mathsf {start}}\in \varDelta _{\mathsf {box}}\) are trivial after noticing \(\varepsilon \le 1/30\). Using the fact that \(Ax^{\mathsf {start}}- {\mathbbm {1}}\ge (1+\varepsilon /2) A x^{\sharp } - {\mathbbm {1}}\ge \varepsilon /2 \cdot {\mathbbm {1}}\), we compute \(f_\mu (x^{\mathsf {start}})\) as follows:

$$\begin{aligned} f_\mu (x^{\mathsf {start}})&= \mu \sum _j e^{\frac{1}{\mu } (1 - (Ax^{\mathsf {start}})_j)} + {\mathbbm {1}}^{T} x^{\mathsf {start}}\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} + 2\mathsf {OPT}\\&\le \frac{\mu m}{(nm)^2} + 2\mathsf {OPT}< 3 \mathsf {OPT}. \end{aligned}$$

\(\square \)

figure d

We now describe \(CovLPSolver^{\mathsf {wb}}\) (which is a symmetric variant of \(PacLPSolver\)) that solves well-behaved covering LP problems, see Algorithm 2. It starts with the initial vector \(\mathsf {x}_0 = \mathsf {y}_0 = x^{\mathsf {start}}\) and \(\mathsf {z}_0 = 0\). Then, \(CovLPSolver^{\mathsf {wb}}\) is divided into T iterations. In each iteration k, it computes a weighted midpoint \(\mathsf {x}_k \leftarrow \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}\) for some parameter \(\tau \in (0,1)\), and then proceeds to compute \(\mathsf {y}_k\) and \(\mathsf {z}_k\) as follows.

We select \(i\in [n]\) uniformly at random. Let \(\xi _k^{(i)} = (0,\dots ,0,-\mathbb {T}^\mathsf {p}(v),0,\dots ,0)\) be the vector that is only non-zero at coordinate i, where \(v = - \nabla _i f_\mu (\mathsf {x}_k) \in [-1,\infty )\), and recall \(\mathbb {T}^\mathsf {p}(v) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\min \{v, 1\}\). We refer to \(\xi _k^{(i)}\) as the truncated gradient. Next,

  • Perform a mirror (descent) step\(\mathsf {z}_k \leftarrow \mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}\) for some parameter \(\alpha _k \ll 1/n\) to be chosen later.

  • Perform a gradient (descent) step\(\mathsf {y}_k \leftarrow \mathsf {y}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mathsf {x}_k + \frac{1}{n \alpha _k L} (\mathsf {z}_k^{(i)} - \mathsf {z}_{k-1})\).

This finishes the description of \(CovLPSolver^{\mathsf {wb}}\). It is not surprising to deduce the following theorem similar to Theorem 3.4. We include its proof in “Appendix C.3” for completeness.

figure e

Again, using the same proof as Corollary 3.5, one can apply Markov’s bound to turn Theorem 5.3 into a probabilistic statement:

figure f

Removing the well-behavior assumption In subsequent work, after the conference presentation of this paper, Wang, Mahoney and Rao [34] showed the following theorem.

Theorem 5.5

([34]) Any covering LP with constraint matrix \(A \in \mathbb {R}^{m \times n}_{\ge 0}\) of sparsity N can be converted into an equivalent but well-behaved covering LP with matrix \(\widetilde{A} \in \mathbb {R}^{m \times n \cdot O(\log (mn/\varepsilon ))}\) and sparsity \(N \cdot O(\log (mn/\varepsilon ))\). The conversion takes time \(N \cdot O(\log (mn/\varepsilon )).\)

As a result, we can apply our covering solver \(CovLPSolver^{\mathsf {wb}}\) to this modified LP and apply our Theorem 5.3 to solve any covering LP in expected time \( O(\frac{\log ^2(nm/\varepsilon ) \log (1/\varepsilon )}{\varepsilon } N).\)

6 Our covering LP solver in the general case

In this section, we remove the well-behavior assumption and propose a different algorithm \(CovLPSolver\) to solve all covering LP instances. This algorithm introduces a factor \(1/\sqrt{\varepsilon }\) loss in the running time, but is a direct covering LP solver without using any reduction.

The main difference to \(PacLPSolver\) and \(CovLPSolver^{\mathsf {wb}}\) is that, this time we abandon the box constraint and study the minimization of \(f_\mu (x)\) over a simplex

$$\begin{aligned} x\in \varDelta _{\mathsf {simplex}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ x \in \mathbb {R}^n \,:\, x_i \ge 0 \; \wedge \; {\mathbbm {1}}^{T} x\le 2 \mathsf {OPT}'\} . \end{aligned}$$

Again, this constraint \({\mathbbm {1}}^{T} x \le 2\mathsf {OPT}'\) is redundant just like the old \(\varDelta _{\mathsf {box}}\) constraint for packing LP (recall Remark 2.5); however, it shall be used to make sure that our updates are always inside \(\varDelta _{\mathsf {simplex}}\). It is a simple fact that

Fact 6.1

\(u^*{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2)x^* \in \varDelta _{\mathsf {simplex}}\).

Recall that the initial vector \(x^\sharp \) is defined in Definition 4.2, and \(\mathsf {OPT}'\) is a crude approximation to \(\mathsf {OPT}\), satisfying \(\mathsf {OPT}' {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\mathbbm {1}}^{T} x^{\sharp } \in [\mathsf {OPT}, 2\mathsf {OPT}]\). We choose different starting vector \(x^{\mathsf {start}}\) from Sect. 5:

Proposition 6.2

Letting \(x^{\mathsf {start}}{\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) \cdot x^{\sharp } + (\frac{1}{n},\dots ,\frac{1}{n})\), we have \(x^{\mathsf {start}}\in \varDelta _{\mathsf {simplex}}\) and \(f_\mu (x^{\mathsf {start}}) \le 4 \mathsf {OPT}\).

Proof

Using \(Ax^{\mathsf {start}}- {\mathbbm {1}}\ge (1+\varepsilon /2) A x^{\sharp } - {\mathbbm {1}}\ge \varepsilon /2 \cdot {\mathbbm {1}}\), we compute \(f_\mu (x^{\mathsf {start}})\) as follows:

$$\begin{aligned} f_\mu (x^{\mathsf {start}})&= \mu \sum _j e^{\frac{1}{\mu } (1 - (Ax^{\mathsf {start}})_j)} + {\mathbbm {1}}^{T} x^{\mathsf {start}}\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} + 2\mathsf {OPT}+ 1\\&\le \frac{\mu m}{(nm)^2} + 3\mathsf {OPT}< 4 \mathsf {OPT}. \end{aligned}$$

Also, we have \({\mathbbm {1}}^{T} x^{\mathsf {start}}\le (1+\varepsilon /2) \mathsf {OPT}'+1 \le 2 \mathsf {OPT}'\). (Recall \(\mathsf {OPT}'\ge 2\) in Definition 4.2.) \(\square \)

figure g

Our proposed algorithm \(CovLPSolver\) starts with the initial vector \(\mathsf {x}_0 = \mathsf {y}_0 = \mathsf {z}_0 = x^{\mathsf {start}}\) and is divided into T iterations. In each iteration k, as usual, it computes a weighted midpoint \(\mathsf {x}_k \leftarrow \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}\) for some parameter \(\tau \in (0,1)\), and then computes \(\mathsf {y}_k\) and \(\mathsf {z}_k\) as follows.

We select \(i\in [n]\) uniformly at random, and let \(\xi _k^{(i)} = (0,\dots ,0,\mathbb {T}^\mathsf {c}(v),0,\dots ,0)\) be the vector that is only non-zero at coordinate i, where \(v = \nabla _i f_\mu (\mathsf {x}_k) \in (-\infty ,1]\) and \(\mathbb {T}^\mathsf {c}(v) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\max \{-\beta , v\}\) is the new thresholding function for some parameter \(\beta {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\sqrt{\varepsilon }\). Then,

  • Perform a mirror (descent) step\(\mathsf {z}_k \leftarrow \mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle (1+\gamma ) n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}\) for some positive parameters \(\gamma \ll 1\) and \(\alpha _k \ll 1/n\), where

    $$\begin{aligned}\textstyle V_x(y) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\sum _{i=1}^n y_i \log \frac{y_i}{x_i} + x_i - y_i\end{aligned}$$

    is the so-called Bregman divergence of the generalized entropy function (c.f. [12]).

  • If \(\nabla _i f_\mu (x_k) < -\beta \), perform a gradient (descent) step\(\mathsf {y}_k \leftarrow \mathsf {y}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\mathsf {x}_k + \delta \mathbf {e}_i\) for some \(\delta >0\). In practice, one can line-search over \(\delta \), but we choose an explicit \(\delta \) as follows.

    • Denote by \(\pi \) the permutation that satisfies \(A_{\pi (1),i} \le \cdots \le A_{\pi (m),i}\)

    • Pick \(j^* \in [m]\) s.t. \(\left\{ \begin{array}{ll} \sum _{j<j^*} A_{\pi (j),i} \cdot p_{\pi (j)}(\mathsf {x}_k) < 1+\beta \\ \sum _{j\le j^*} A_{\pi (j),i} \cdot p_{\pi (j)}(\mathsf {x}_k) \ge 1+\beta \end{array} \right. \,.\)  Such \(j^*\) exists because

      $$\begin{aligned} \textstyle \sum _{j=1}^m A_{\pi (j),i} \cdot p_{\pi (j)}(\mathsf {x}_k) = \sum _{j=1}^m A_{ji} \cdot p_j(\mathsf {x}_k) = 1 - \nabla _i f_\mu (\mathsf {x}_k) \ge 1+\beta . \end{aligned}$$
      (6.1)
    • Set \(\delta = \frac{\mu \beta }{2 A_{\pi (j^*),i}}\).

This finishes the description of our \(CovLPSolver\).

Remark 6.3

We use the superscript \(^{(i)}\) on \(\xi _k^{(i)}\), \(\mathsf {y}_k^{(i)}\) and \(\mathsf {z}_k^{(i)}\) to emphasize that the value depends on the choice of i. We have used generic parameters \(\tau ,\alpha _k,T\) in the above description and their precise values are presented in Algorithm 3.

Our update on \(\mathsf {y}_k\) is a “gradient descent step” because we shall prove that it strictly decreases the objective (i.e., \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\ge 0\)). Our update on \(\mathsf {z}_k\) is a “mirror descent step” because we shall apply standard mirror descent analysis [12] to it. We explicitly describe how to implement this mirror step: (proved in “Appendix D.4”)

Proposition 6.4

If \(\mathsf {z}_{k-1}\in \varDelta _{\mathsf {simplex}}\) and \(\mathsf {z}_{k-1}>0\), the minimizer \(z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}\) for any scalar \(\delta \in \mathbb {R}\) and basis vector \(\mathbf {e}_i\) can be computed as follows:

  1. 1.

    \(z \leftarrow \mathsf {z}_{k-1}\).

  2. 2.

    \(z_i \leftarrow z_i \cdot e^{-\delta }\).

  3. 3.

    If \({\mathbbm {1}}^{T} z > 2\mathsf {OPT}'\), \(z \leftarrow \frac{2\mathsf {OPT}'}{{\mathbbm {1}}^{T} z} z\).

  4. 4.

    Return z.

We also point out that

Lemma 6.5

Each iteration of \(CovLPSolver\) can be implemented to run in expected O(N / n) time. The total expected running time is O(TN / n).

The proof of Lemma 6.5 is analogous to its packing counterpart, and included in Sect. F.6.

6.1 Main proof ideas and ingredients

In \(CovLPSolver\), we pick a random coordinate \(i\in [n]\) at each iteration, and decompose \(\nabla _i f(\mathsf {x}_k) = \xi + \eta \), where \(\eta \in (-\infty , 0]\) is the (negative) large gradient component and \(\xi \in [-\sqrt{\varepsilon },1]\) is the truncated gradient component. In other words, we truncate the gradient \(\nabla _i f(\mathsf {x}_k)\) at a negative threshold \(-\beta = -\sqrt{\varepsilon }\), rather than at \(-1\) as in \(CovLPSolver^{\mathsf {wb}}\).

The reason for this new threshold \(-\sqrt{\varepsilon }\) can be understood as follows. In \(PacLPSolver\) (and symmetrically in \(CovLPSolver^{\mathsf {wb}}\)), we used Lemma 3.10 to show that our gradient descent step \(\mathsf {y}_k\) decreases the objective by an amount that both includes the \(\xi \) and \(\eta \) components. Unfortunately, for covering LP, this decrease amount is only proportional to \(\eta \) but not to \(\xi \) (compare Lemma 3.10 with Lemma 6.14 later). This forces us to treat the small gradient \(\xi \) separately using mirror descent, but not gradient descent.

If \(\xi \) were in \([-1, 1]\), classical theory of mirror descent [12] (or multiplicative weight update [7]) would imply that the mirror step \(\mathsf {z}_k\) converges at a rate \(\propto \varepsilon ^{-2}\). This is too slow. Instead, since truncated \(\xi \) into a smaller interval \([-\sqrt{\varepsilon },1]\), using a negative-width technique (see Sect. 6.5), we improve this mirror-descent convergence rate from \(\varepsilon ^{-2}\) to \(\varepsilon ^{-1.5}\).

On the other hand, due to this truncation at \(-\sqrt{\varepsilon }\) instead of \(-1\), our gradient step on \(\mathsf {y}_k\) also converges slower, at a rate \(1/\varepsilon ^{1.5}\) instead of \(1/\varepsilon \). This is why \(\beta = \sqrt{\varepsilon }\) is the best truncation threshold, as it balances gradient and mirror descent.

Another ingredient behind our proof is a new distance bound that is uncommon in first-order analysis. Recall that, given convex function g(x), traditional analysis applies convexity argument \(g(x) - g(x^*) \le \langle \nabla g(x), x - x^* \rangle \) to bound the objective distance to optimum. If \(g(x) = e^{-x}\) is univariate, \(x = -1\), and \(x^* = -100\), this bound becomes \(e^{-1} \approx e^{-1} - e^{-100} \le e^{-1} \cdot 99\), which is very loose. In our analysis, we replace this convexity argument with a more benign bound, specifically designed for covering LP (see Lemma 6.10).

6.2 Convergence statement

The main convergence theorem of this section is as follows:

figure h

Again, using the same proof as Corollary 3.5, one can apply Markov’s bound to turn Theorem 5.3 into a probabilistic statement:

figure i

Before delving into the proof of Theorem 6.6, we make the following observations:

Fact 6.8

For every \(k\in \{0,1,\dots ,T\}\), it satisfies \(\mathsf {x}_k, \mathsf {y}_k \ge 0\), \(\mathsf {z}_k > 0\), and \(\mathsf {z}_k \in \varDelta _{\mathsf {simplex}}.\)

Proof

Since the \(x^{\mathsf {start}}\) satisfies \({\mathbbm {1}}^{T} x^{\mathsf {start}}\le 2\mathsf {OPT}'\) by Proposition 6.2, we have \(\mathsf {z}_0 = x^{\mathsf {start}}\in \varDelta _{\mathsf {simplex}}\). Also, the mirror descent step (see Proposition 6.4) ensures \(\mathsf {z}_{k,i} > 0\) for all rounds k and coordinates i, as well as \(\mathsf {z}_k \in \varDelta _{\mathsf {simplex}}\) for all rounds k. However, we \(\mathsf {x}_k\) and \(\mathsf {y}_k\) may not necessarily lie inside \(\varDelta _{\mathsf {simplex}}\), but will always stay non-negative. \(\square \)

We prove Theorem 6.6 in the subsequent subsections.

6.3 Step 1: distance adjustment

Using convexity, one can obtain

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (u) \le \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - u \rangle \quad \text {for every}\, u\in \varDelta _{\mathsf {simplex}}. \end{aligned}$$
(6.2)

Note that inequality (6.2) can be very loose for exponential functions. For instance, if \(f_\mu (x)\) were as simple as \(e^x\), then the convexity inequality \(e^b - e^a \le e^b \cdot (b-a)\) says

  • when \(b = 2\) and \(a=-10\), we have \(e^{2} - e^{-10} \le 12 e^2\);

  • when \(b=2\) and \(a=-100\), we have \(e^2 - e^{-100} \le 102 e^2\).

Although \(e^{-100} \approx e^{-10}\), the two upper bounds are off from each other by a factor of 10.

In this section, we strengthen (6.2) in the special case of \(u = u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2)x^*\). For analysis purpose, let \(\widetilde{A}\) be the adjusted matrix of A described as follows.

Definition 6.9

(adjusted matrix \(\widetilde{A}\)) For each row \(j\in [m]\), if \((A u^*)_j \le 2\) then we keep it and let \(\widetilde{A}_{j :} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}A_{j :}\). Otherwise,—that is, if \((A u^*)_j > 2\)—we define \(\widetilde{A}_{j :} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\frac{2}{(A u^*)_j} \cdot A_{j :}\) to be the same j-th row \(A_{j :}\), but scaled down by a factor of \(\frac{2}{(A u^{*})_{j}}\). It is clear from this definition that

$$\begin{aligned} A_{ji} \ge \widetilde{A}_{ji}\quad {\text {for all}}\quad (i, j)\in [n] \times [m] \quad {\hbox {and}} \quad (1+\varepsilon ){\mathbbm {1}}\le {\widetilde{A}u^{*} \le 2}{{\mathbbm {1}}}. \end{aligned}$$

Lemma 6.10

(distance adjustment)

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (u^*)&\le \langle {\mathbbm {1}}- A^T p(\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle + \varepsilon \mathsf {OPT}\\&= \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle + \varepsilon \mathsf {OPT}\end{aligned}$$

At high level, ignoring the negligible term \(\varepsilon \mathsf {OPT}\), Lemma 6.10 strengthens the classical bound due to the extra term of \(\langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \). This extra term is always non-positive since \(\widetilde{A}\le A\) coordinate-wise, but may be very negative in certain cases.

Proof of Lemma 6.10

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (u^*) = \mu \sum _{j=1}^m \left( e^{\frac{1}{\mu } (1-(A\mathsf {x}_k)_j)} - e^{\frac{1}{\mu } (1-(A u^*)_j)}\right) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }\mu \sum _{j=1}^m \left( e^{\frac{1}{\mu } (1-(A\mathsf {x}_k)_j)} - e^{\frac{1}{\mu } (1-(\widetilde{A}u^*)_j)}\right) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \mu \cdot m \cdot e^{-1/\mu }\\&\quad \overset{\textcircled {{\small 2}}}{\le }\sum _{j=1}^m e^{\frac{1}{\mu } (1-(A\mathsf {x}_k)_j)} \cdot \big ( (\widetilde{A}u^*)_j - (A \mathsf {x}_k)_j \big ) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \varepsilon \mathsf {OPT}\\&\quad = \sum _{j=1}^m p_j(\mathsf {x}_k) \cdot \big ( (\widetilde{A}u^*)_j - (A \mathsf {x}_k)_j \big ) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \varepsilon \mathsf {OPT}\\&\quad = \sum _{j=1}^m p_j(\mathsf {x}_k) \cdot \big ( (A u^*)_j - (A \mathsf {x}_k)_j \big ) + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle \\&\qquad + \sum _{j=1}^m p_j(\mathsf {x}_k) \cdot \big ( (\widetilde{A}u^*)_j - (A u^*)_j \big ) + \varepsilon \mathsf {OPT}\\&\quad = \langle - A^T p(\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle {\mathbbm {1}}, \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle + \varepsilon \mathsf {OPT}. \end{aligned}$$

Above, \(\textcircled {{\small 1}}\)is because if \((A u^*)_j \ne (\widetilde{A}u^*)_j\) for some j, then it must satisfy that \((\widetilde{A}u^*)_j = 2\), and therefore \(-e^{\frac{1}{\mu }(1-(A u^*)_j)} \le -e^{\frac{1}{\mu }(1-(\widetilde{A}u^*)_j)} + e^{-1/\mu }\). \(\textcircled {{\small 2}}\)uses the convexity inequality of \(e^b - e^a \le e^b \cdot (b-a)\), and the fact that \(\mu m e^{-1/\mu } \ll \varepsilon \mathsf {OPT}\).

6.4 Step 2: gradient truncation

For analysis purpose, let us separate the indices \(i\in [n]\) into large and small ones.

Definition 6.11

We make the following definitions.

  • Let \(B_k {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\{ i \in [n] \,:\, \nabla _i f_\mu (\mathsf {x}_k) < -\beta \}\) and \([n] \setminus B_k\) be the set of large and small indices .

  • Let \(\xi _{k} \in [-\beta ,1]^n\) be the truncated gradient so that \(\xi _{k,i} = \mathbb {T}^\mathsf {c}(\nabla _i f_\mu (x_k))\) for each \(i\in [n]\).

  • Let \(\eta _{k} \in (-\infty ,0]^n\) be the large gradient so that \(\nabla f_\mu (\mathsf {x}_k) = \xi _k + \eta _k\). It is clear that

    $$\begin{aligned} \eta _{k,i} = 0\quad \text {for every}\quad i\not \in B, \quad \text {and}\quad \eta _{k,i} = (1+\beta ) - (A^T p(\mathsf {x}_k))_i\quad \text {for every}\quad i\in B. \end{aligned}$$
  • Let \(\widetilde{\eta }_k \in (-\infty ,\infty )^n\) be the adjusted large gradient so that

    $$\begin{aligned} \widetilde{\eta }_{k,i} = 0\quad \text {for every}\quad i\not \in B, \quad \text {and}\quad \widetilde{\eta }_{k,i} = (1+\beta ) - (\widetilde{A}^T p(\mathsf {x}_k))_i\quad \text {for every}\quad i\in B. \end{aligned}$$

We denote by \(\eta _k^{(i)} = (0,\dots ,0,\eta _{k,i},0,\dots ,0)\), the vector that is zero at all coordinates other than i, and similarly \(\xi _k^{(i)} = (0,\dots ,\xi _{k,i},\dots ,0)\) and \(\widetilde{\eta }_k^{(i)}=(0,\dots ,\widetilde{\eta }_{k,i},\dots ,0)\). We emphasize that \(\eta _k^{(i)} \ne \eta _{k}\), \(\widetilde{\eta }_k^{(i)} \ne \widetilde{\eta }_{k}\), and \(\xi _k^{(i)} \ne \xi _{k}\).

The following key lemma is very analogous to (3.1) in the packing LP analysis.

Lemma 6.12

(distance upper bound)

$$\begin{aligned}&f_\mu (\mathsf {x}_k) - f_\mu (u^*) \le \frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \mathbf {E}_i \Big [ \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle \Big ] \\&\qquad +\,\mathbf {E}_{i} \Big [ \langle n \widetilde{\eta }_k^{(i)}, - u^* \rangle \Big ] + \varepsilon \mathsf {OPT}. \end{aligned}$$

Note that if one uses \(\eta _k^{(i)}\) instead of \(\widetilde{\eta }_k^{(i)}\), then Lemma 6.12 becomes trivial to prove just like (3.1). The reason we can have the stronger term \(\widetilde{\eta }_k^{(i)}\) is precisely due to the distance adjustment Lemma 6.10.

Proof of Lemma 6.12

We derive the following sequence of inequalities:

$$\begin{aligned}&\big (f_\mu (\mathsf {x}_k) - f_\mu (u^*)\big ) - \varepsilon \mathsf {OPT}\\&\quad \overset{\textcircled {{\small 1}}}{\le }\langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad = \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \rangle + \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - u^* \rangle + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad \overset{\textcircled {{\small 2}}}{=} \frac{(1-\tau )}{\tau } \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {y}_{k-1} - \mathsf {x}_{k} \rangle + \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad \overset{\textcircled {{\small 3}}}{\le }\frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad = \frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \xi _k + \eta _k, \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k), u^* \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }\frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \xi _k, \mathsf {z}_{k-1} - u^* \rangle \\&\qquad + \langle \widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k) - \eta _k, u^* \rangle \\&\quad \overset{\textcircled {{\small 5}}}{\le }\frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \langle \xi _k, \mathsf {z}_{k-1} - u^* \rangle + \langle - \widetilde{\eta }_k, u^* \rangle \\&\quad = \frac{(1-\tau )}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \mathbf {E}_i \Big [ \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle + \langle - n \widetilde{\eta }_k^{(i)}, u^* \rangle \Big ] . \end{aligned}$$

Above, \(\textcircled {{\small 1}}\)is due to Lemma 6.10. \(\textcircled {{\small 2}}\)is because \(\mathsf {x}_{k} = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\), which implies that \(\tau (\mathsf {x}_k - \mathsf {z}_{k-1}) = (1-\tau ) (\mathsf {y}_{k-1} - \mathsf {x}_k)\). \(\textcircled {{\small 3}}\)is by the convexity of \(f_\mu (\cdot )\). \(\textcircled {{\small 4}}\)is because \(\langle \eta _{k}, \mathsf {z}_{k-1} \rangle \le 0\), since \(\eta _{k} \le 0\) while \(\mathsf {z}_{k-1}\ge 0\). \(\textcircled {{\small 5}}\)needs some careful justification: for every \(i\not \in B_k\), we have \((\widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k))_i - \eta _{k,i} \le 0 - 0 = - \widetilde{\eta }_{k,i}\); for every \(i\in B_k\), we have

$$\begin{aligned} (\widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k))_i - \eta _{k,i}= & {} (\widetilde{A}^T p(\mathsf {x}_k) - A^T p(\mathsf {x}_k))_i - \big ( (1+\beta ) - (A^T p(\mathsf {x}_k))_i \big ) \\= & {} - \big ( (1+\beta ) - (\widetilde{A}^T p(\mathsf {x}_k))_i \big ) = - \widetilde{\eta }_{k,i} , \end{aligned}$$

where the two equalities follow from the definitions of \(\eta _{k,i}\) and \(\widetilde{\eta }_{k,i}\).

6.5 Step 3: mirror descent guarantee

Our update \(\mathsf {z}_k^{(i)} {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}{\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle (1+\gamma ) n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}\) is, by its definition, a mirror descent step [12]. We begin by explaining an attempt that is too weak for obtaining the \(\varepsilon ^{-1.5}\) convergence rate.

Using the classical theory, it is not hard to repeat the proof of Lemma 3.7—although changing the distance function from \(\Vert \cdot \Vert _A^2\) to \(V_x(y)\)—and obtain that, as long as \(\xi _{k,i}\) is in \([-1, +1]\) for each coordinate i, for every \(u \in \varDelta _{\mathsf {simplex}}\),

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}(u) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}(u) \Big ] + O(\alpha _k^2 n ) \mathsf {OPT}. \end{aligned}$$

This inequality only yields a slower \(\varepsilon ^{-2}\) convergence rate, and \(\pm 1\) is also know as the width parameter from the multiplicative-weight-update language [7].

In our lemma below, we make use of the fact \(\xi _{k,i}\) is in \([-\beta , +1] \subseteq [-1,+1]\). In essence, this allows us to replace the \(O(\alpha _k^2 n)\) factor with a better \(O(\alpha _k^2 \beta n)\) factor. We call it the negative-width technique.Footnote 12 Formally,

Lemma 6.13

(mirror descent) Denoting by \(\gamma {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}2\alpha _T n\), we have

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta . \end{aligned}$$

The proof is somewhat technical and included in “Appendix D.4”.

6.6 Step 4: gradient descent guarantee

We show our gradient step never increases the objective for all choices of i. In addition, it decreases the objective by an amount proportional to the adjusted large gradient \(\widetilde{\eta }_k^{(i)}\).

Lemma 6.14

(gradient descent) For every \(i\in [n]\), we have

  1. (a)

    \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\), and

  2. (b)

    \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{\mu \beta }{12} \cdot \langle - \widetilde{\eta }_{k}^{(i)}, u^* \rangle .\)

The proof of Lemma 6.14 is quite technical and can be found in “Appendix D.4”.

At high level, one would hope to prove that the gradient step decreases the objective by an amount proportional to the large gradient \(\eta _k^{(i)}\), rather than the adjusted large gradient \(\widetilde{\eta }_k^{(i)}\). If that were true, the entire proof structure of our covering LP convergence would become much closer to that of packing LP, and there would be absolutely no need for the introduction of the distance adjustment in Sect. 6.3, as well as the definitions of \(\widetilde{A}\) and \(\widetilde{\eta }\).

Unfortunately, if one replaces \(\widetilde{\eta }\) with \(\eta \) in the above lemma, the inequality is false. The reason behind it is very similar to what we have summarized in Sect. 6.3, and related to the fact that the exponential penalty function is not Lipschitz smooth.

6.7 Step 5: putting all together

Combining Lemma 6.12, Lemma 6.13, and Lemma 6.14, we obtain that

$$\begin{aligned}&\alpha _k \big (f_\mu (\mathsf {x}_k) - f_\mu (u^*)\big ) - \alpha _k \varepsilon \mathsf {OPT}\\&\quad \le \frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + \mathbf {E}_i \Big [ \alpha _k \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle \Big ]\\&\qquad +\mathbf {E}_{i} \Big [ \alpha _k \langle n \widetilde{\eta }_k^{(i)}, - u^* \rangle \Big ] \\&\quad \le \frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1}) - f_\mu (\mathsf {x}_k)) + V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] \\&\qquad + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta + \mathbf {E}_i \Big [ \frac{12\alpha _k n}{\mu \beta } \big (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\big ) \Big ] \end{aligned}$$

Remark 6.15

Above, the quantity “\(12 \mathsf {OPT}\cdot \gamma \alpha _k \beta \)” is the loss term introduced by the mirror descent. Unlike the packing LP case—see (3.2)—this loss term is not dominated by the gradient step. (If one could do so, this would give \(CovLPSolver\) an \(\varepsilon ^{-1}\) convergence rate.)

The quantity “\(\alpha _k \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \rangle \)” is the loss introduced by the (adjusted) large gradient \(\widetilde{\eta }\), and is dominated by our gradient step progress owing to Lemma 6.14. This is similar to the packing LP case—see Lemma 3.10.

From here, let us use the special choice of \(\tau = \frac{\mu \beta }{12n}\). We obtain that

$$\begin{aligned}&- \alpha _k \big (f_\mu (u^*) + \varepsilon \mathsf {OPT}\big ) \\&\quad \le 12\gamma \alpha _k \beta \mathsf {OPT}+ \frac{(1-\tau )\alpha _k}{\tau } f_\mu (\mathsf {y}_{k-1}) + V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big )\\&\qquad - \mathbf {E}_i \Big [ \frac{\alpha _k}{\tau } f_\mu (\mathsf {y}_k^{(i)}) + V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] . \end{aligned}$$

Using the choice \(\alpha _k = \frac{\alpha _{k-1}}{1-\tau }\) and telescoping the above inequality for \(k=1,\dots ,T\), we have

$$\begin{aligned} -\big (\sum _{k=1}^T \alpha _k \big ) \big (f_\mu (u^*) + \varepsilon \mathsf {OPT}\big )&\le \big (\sum _{k=1}^T \alpha _k\big ) \cdot 12 \gamma \beta \mathsf {OPT}+ \frac{\alpha _0}{\tau } f_\mu (\mathsf {y}_0)\\&\quad + V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) - \frac{\alpha _T}{\tau } \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]. \end{aligned}$$

We compute that \(\sum _{k=1}^T \alpha _k = \alpha _T \cdot \sum _{k=0}^{T-1} (1-\tau )^k = \alpha _T \cdot \frac{1-(1-\tau )^T}{\tau } < \frac{\alpha _T}{\tau }\), and recall that \(\gamma = 2\alpha _T n\). Therefore, we rearrange and get

$$\begin{aligned} \frac{\alpha _T}{\tau } \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]&\le \frac{\alpha _T}{\tau } \big (f_\mu (u^*) + \varepsilon \mathsf {OPT}\big ) + \frac{\alpha _T}{\tau } \cdot 12 \gamma \beta \mathsf {OPT}+ \frac{\alpha _0}{\tau } f_\mu (\mathsf {y}_0) \nonumber \\&\quad + V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) , \nonumber \\ \implies \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]&\le f_\mu (u^*) + \varepsilon \mathsf {OPT}+ 24 \alpha _T n \beta \mathsf {OPT}+ (1-\tau )^T f_\mu (\mathsf {y}_0) \nonumber \\&\quad + \frac{\tau }{\alpha _T} V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) . \end{aligned}$$
(6.3)

From this point, we need to use our special choice of the initial point \(\mathsf {x}_0 = \mathsf {y}_0 = \mathsf {z}_0 = x^{\mathsf {start}}\) (see Proposition 6.2), which implies that \(f_\mu (\mathsf {y}_0) \le 4\mathsf {OPT}\) and \({\mathbbm {1}}^{T} x^{\mathsf {start}}\le 4\mathsf {OPT}\). We also have

$$\begin{aligned} V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) = V_{x^{\mathsf {start}}}\big (\frac{u^*}{1+\gamma }\big )&= \sum _{i=1}^n \frac{u^*_i}{1+\gamma } \log \frac{u^*_i}{(1+\gamma ) x^{\mathsf {start}}_i} + x^{\mathsf {start}}_{i} - \frac{u^*_i}{1+\gamma } \\&\overset{\textcircled {{\small 1}}}{\le }\sum _{i=1}^n u^*_i \log (u^*_i \cdot n) + 4\mathsf {OPT}\overset{\textcircled {{\small 2}}}{\le }(2\log (nm)+4) \cdot \mathsf {OPT}. \end{aligned}$$

Above, inequality \(\textcircled {{\small 1}}\)follows because \(x^{\mathsf {start}}_i \ge 1/n\) for all \(i\in [n]\) according to the definition in Proposition 6.2; inequality \(\textcircled {{\small 2}}\)follows because each \(u^*_i \le (1+\varepsilon /2) x^*_i \le (1+\varepsilon /2)\mathsf {OPT}\le (1+\varepsilon /2)m\) and \({\mathbbm {1}}^{T} u^*_i = (1+\varepsilon /2)\mathsf {OPT}\), as well as the fact that \(\varepsilon \) is sufficiently small.

Finally, we choose \(\beta = \sqrt{\varepsilon }\), \(T = \lceil \frac{1}{\tau } \log (1/\varepsilon ) \rceil \), and \(\alpha _0\) such that \(\alpha _T = \frac{\varepsilon }{12n\beta }\). Substituting into (6.3) all of these parameters, along with the aforementioned inequalities \(f_\mu (\mathsf {y}_0) \le 4\mathsf {OPT}\) and \(V_{\mathsf {z}_{0}}\big (\frac{u^*}{1+\gamma }\big ) \le (2\log (nm)+4) \cdot \mathsf {OPT}\), as well as \(f_\mu (u^*) \le (1+\varepsilon )\mathsf {OPT}\) from Proposition 4.5.a, we obtain that

$$\begin{aligned} \mathbf {E}\big [ f_\mu (\mathsf {y}_T) \big ]\le & {} (1+\varepsilon )\mathsf {OPT}+ \varepsilon \mathsf {OPT}+ 2\varepsilon \mathsf {OPT}+ \varepsilon f_\mu (\mathsf {y}_0)\\&+ \frac{\mu \beta /12n}{\varepsilon /12n\beta } (2\log (nm)+4)\mathsf {OPT}\\\le & {} (1+9\varepsilon ) \mathsf {OPT}. \end{aligned}$$

This finishes the proof of Theorem 6.6. \(\square \)