Key words:

Introduction

Importance sampling in simulation

The usual setup for importance sampling is in Monte Carlo simulation: one wants to compute an integral of the form \(\int _{\mathcal{D}}f(x)p(x)dx,\) where  px) is a probability density: \(\int _{\mathcal{D}}p(x)dx = 1\). An easy and computationally efficient way to approximate such an integral is to consider the integral as an expectation, \(\mu = \mathbb{E}(f(x)) =\int _{\mathcal{D}}f(x)p(x)dx,\) and approximate the expectation as a sample average,

$$\displaystyle{\int _{\mathcal{D}}f(x)dx \approx \frac{1} {m}\sum _{i=1}^{m}f(x_{ i}),\quad x_{i} \sim p,}$$

where the random variables  x  i are independent and ideally distributed. Validity of this approximation is ensured by the law of large numbers, but the number of samples  m needed for a given approximation accuracy grows with the variance of the random variable  fx). In particular, if  fx) is nearly zero on its domain \(\mathcal{D}\) except in a region \(A \subset \mathcal{D}\) for which \(\mathbb{P}(x \in A)\) is small, then such standard Monte Carlo sampling may fail to have even one point inside the region  A. It is clear intuitively that in this situation, we would benefit from getting some samples from the interesting or important region  A. What  importance sampling means is to sample from a different density  qx) which overweights this region, rescaling the resulting quantity in order that the estimate remain unbiased.

More precisely, if  x has probability density  px), then

$$\displaystyle\begin{array}{rcl} \mu & =& \mathbb{E}[f(x)] =\int _{\mathcal{D}}f(x)p(x)dx \\ & =& \int _{\mathcal{D}}f(x)\frac{p(x)} {q(x)}q(x)dx = \mathbb{E}_{q}[f(x)w(x)],{}\end{array}$$
(1)

where \(w(\cdot ) \equiv \frac{p(\cdot )} {q(\cdot )}\) is the  weighting function. By (1), the estimator

$$\displaystyle{ \widehat{\mu }= \frac{1} {m}\sum _{i=1}^{m}f(x_{ i})w(x_{i}),\quad \quad x_{i} \sim q, }$$
(2)

is also an unbiased estimator for  μ. The  importance sampling problem then focuses on finding a biasing density  qx) which overweights the important region close to an “optimal” way, at least such the  variance of the importance sampling estimator is smaller than the variance of the general Monte Carlo estimate, so that fewer samples  m are required to achieve a prescribed estimation error. In general, the density  q with minimal variance \(\sigma _{q{\ast}}^{2}\) is proportional to |  fx) |  px), which is unknown a priori; still, there are many techniques for estimating or approximating this optimal distribution, see [31, Chapter 9].

Importance sampling beyond simulation

In recent times, probabilistic and stochastic algorithms have seen an explosion of growth as we move towards  bigger data problems in  higher dimensions. Indeed, we are often in the situation where at least one of the following is true:

  1. 1.

    Taking measurements is expensive, and we would like to reduce the number of measurements needed to reach a prescribed approximation accuracy

  2. 2.

    Optimizing over the given data is expensive, and we would like to reduce the number of computations needed to get within a prescribed tolerance of the optimal solution.

Importance sampling has proved to be helpful in both regimes. Whereas in simulation, importance sampling has traditionally been used for approximating  linear estimates such as expectations/integrals, recent applications in signal processing and machine learning have considered importance sampling in approximating or even exactly recovering nonlinear estimates as well.

We consider here three case studies where the principle of importance sampling has been applied; this is by no means a complete list of all applications of importance sampling to machine learning and signal processing problems.

  1. 1.

     Stochastic optimization: Towards minimizing \(F: \mathbb{R}^{n} \rightarrow \mathbb{R}\) of the form \(F(x) =\sum _{ i=1}^{m}f_{i}(x)\) via stochastic gradient descent, one iterates \(x_{k+1} = x_{k} -\gamma w(i_{k})\nabla f_{i_{k}}(x_{k})\) with  i  k randomly chosen from \(\{1,2,\ldots,m\}\) so that

    $$\displaystyle{\mathbb{E}_{i_{k}}[x_{k+1}] = x_{k} -\gamma \sum _{i=1}^{m}\nabla f_{ i}(x_{k});}$$

    that is, one implements a full gradient descent update at each iteration,  in expectation. Standard procedure is to sample indices from \(\{1,2,\ldots,m\}\) uniformly, and the resulting convergence rate is limited by the  worst-case Lipschitz constant associated with the component gradient functions. If however one has prior knowledge about the component Lipschitz constants, and has the liberty to draw indices proportionately to the associated Lipschitz constants, then the convergence rate of stochastic gradient can be improved so as to depend on the  average Lipschitz constant among the components. This is in line with the principle of importance sampling: if ∇ f  i has a larger Lipschitz constant, then this component is contributing more in content, and should be sampled with higher probability. We review some results of this kind in more detail below. For more details, see Section “Importance sampling in Stochastic Optimization”.

  2. 2.

     Compressive sensing: Consider an orthonormal matrix \(\varPhi \in \mathbb{R}^{n\times n}\) (or \(\varPhi \in \mathbb{C}^{n\times n}\)), along with a vector \(x \in \mathbb{R}^{n}\). Then clearly

    $$\displaystyle{\varPhi ^{{\ast}}\varPhi x = x;}$$

    moreover, if \(\varphi _{i_{k}} \in \mathbb{R}^{1,n}\) is a randomly selected row from  Φ, drawn such that row  i is sampled with probability  pi), then also

    $$\displaystyle{\mathbb{E}_{p}\left [ \frac{1} {[p(i_{k})]^{2}}\left (\varphi _{i_{k}}^{{\ast}}\varphi _{ i_{k}}\right )\right ]x = x.}$$

    Compressive sensing shows that if  x is  s-sparse, with  s ≪  n, then for certain orthonormal  Φ, as few as \(m \propto s\log ^{4}(n)\) i.i.d. samples of the form \(\langle \varphi _{i_{k}},x\rangle\) can suffice to  exactly recover  x as the solution to a convex optimization program. For instance, such results hold if all of the rows of  Φ are “equally important” (i.e.,  Φ has uniformly bounded entries), and if rows are drawn i.i.d. uniformly from  Φ. One may also incorporate importance sampling: if rows are drawn i.i.d. proportionately to their squared Euclidean norm, and if the  average Euclidean row norm is small, then \(m \propto s\log ^{4}(n)\) i.i.d. samples still suffice for exact reconstruction. For more details, see Section “’Importance sampling in compressive sensing’.

  3. 3.

     Low-rank matrix approximations: Consider a matrix \(M \in \mathbb{R}^{n_{1}\times n_{2}}\) of rank \(r\ll \min \{ n_{1},n_{2}\}\), and a subset \(\varOmega \subset [n_{1}] \times [n_{2}]\) of |  Ω |  =  m revealed entries  M  i,  j . If the entries are revealed as i.i.d. draws where  Prob[( i,  j)] =  p  i,  j , then \(\mathbb{E}\left [ \frac{1} {p_{i,j}}M_{i,j}\right ] = M\). Importance sampling here corresponds to putting more weight  p  i,  j on “important” entries in order to exactly recover  M using fewer samples. We will see that if entries are drawn from a weighted distribution based on matrix  leverage scores, then \(m = r\log ^{2}(\max \{n_{1},n_{2}\})\) revealed entries suffices for  M to be exactly recoverable as the solution to a convex optimization problem.

Importance sampling in Stochastic Optimization

 Gradient descent is a standard method for solving unconstrained optimization problems of the form

$$\displaystyle{ \min _{x\in \mathbb{R}^{n}}F(x); }$$
(3)

gradient descent proceeds as follows: initialize \(x_{0} \in \mathbb{R}^{n}\), and iterate along the direction of the negative gradient of  F (the direction of “steepest descent”) until convergence

$$\displaystyle{ x_{k+1} = x_{k} -\gamma _{k}\nabla F(x_{k}). }$$
(4)

Here  γ  k is the step-size which may change at every iteration. For optimization problems of very big size, however, even a full gradient computation of the form ∇ Fx  k ) can require substantial computational efforts and full gradient descent might not be feasible. This has motivated recent interest in random coordinate descent or stochastic gradient methods (see [3, 28, 29, 35, 36, 40], to name just a few), where one descends along gradient directions which are cheaper to compute. For example, suppose that  F to be minimized is differentiable and admits a decomposition of the form

$$\displaystyle{ F(x) =\sum _{ i=1}^{m}f_{ i}(x). }$$
(5)

Since \(\nabla F(x) =\sum _{ i=1}^{m}\nabla f_{i}(x)\), a full gradient computation involves computing all  m gradients ∇ f  i x); still, one could hope to get  close to the minimum, at a much smaller expense, by instead selecting a single index  i  k at random from \(\{1,2,\ldots,m\}\) at each iteration. This is the principle behind  stochastic gradient descent.

(5)  Stochastic Gradient (SG)

Consider the minimization of \(F: \mathbb{R}^{n} \rightarrow \mathbb{R}\) of the form \(F(x) =\sum _{ i=1}^{m}f_{i}(x)\). Choose \(x_{0} \in \mathbb{R}^{n}\). For  k ≥ 1 iterate until convergence criterion is met:

  1. 1.

    Choose  i  k  ∈ [ m] according to the rule \(\mathbf{Prob}[i_{k} = k] = w(k)\)

  2. 2.

    Update \(x_{k+1} = x_{k} -\gamma \frac{1} {w(i_{k})}\nabla f_{i_{k}}(x_{k})\).

We have set the step-size  γ to be constant for simplicity. Note that with the normalization in the update rule,

$$\displaystyle\begin{array}{rcl} \mathbb{E}^{(w)}[x_{ k+1}]& =& x_{k} -\gamma \sum _{i=1}^{m}\nabla f_{ i_{k}}(x_{k}) \\ & =& x_{k} -\gamma \nabla F(x_{k}). {}\end{array}$$
(6)

Thus, we might hope for convergence  in expectation of such stochastic iterations to the minimizer of (5) under similar conditions guaranteeing convergence of full gradient descent, namely, when  F is convex (so that every minimizer is a global minimizer) and ∇ F is Lipschitz continuous [30]. That is, we will assume

  1. 1.

     F is convex with convexity parameter  μ =  μF) ≥ 0: for any  x and  y from \(\mathbb{R}^{n}\) we have

    $$\displaystyle{ F(y) \geq F(x) +\langle \nabla F(x),y - x\rangle + \frac{1} {2}\mu \|y - x\|^{2}. }$$
    (7)

    When  μ > 0 strictly, we say that  F is  μ -strongly convex.

  2. 2.

    The component functions  f  i are continuously differentiable and satisfy

    $$\displaystyle{ \|\nabla f_{i}(x) -\nabla f_{i}(y)\| \leq L_{i}\|y - x\|,\quad \quad i = 1,2,\ldots,m,\quad x,y \in \mathbb{R}^{n}. }$$
    (8)

    We refer to  L  i as the  Lipschitz constant of ∇ f  i .

The default sampling strategy in stochastic gradient methods is to sample uniformly, taking \(w(i) = \frac{1} {m}\) in (5). In cases where the component functions  f  i are only observed sequentially or in a streaming fashion, one does not have the freedom to choose a different sampling strategy. But if one  does have such freedom, and has prior knowledge about the distribution of the Lipschitz constants  L  i associated with the component function gradients, choosing probabilities \(w(i) \propto L_{i}\) can significantly speed up the convergence rate of stochastic gradient. This is in line with the principle of importance sampling: if ∇ f  i has a larger Lipschitz constant, it is contributing more in content, and should be sampled with higher probability. We review some results of this kind in more detail below.

Stochastic Gradient (SG) with Importance Sampling

For strongly convex functions, a central quantity in the analysis of stochastic descent is the  conditioning of the problem, which is, roughly speaking, the ratio of the Lipschitz constant to the parameter of strong convexity. Recall that for a convex quadratic \(F(x) = \frac{1} {2}x'Hx\), the Lipschitz constant of the gradient is given by the maximal eigenvalue of the Hessian  H while the parameter of strong convexity is given by its minimal eigenvalue, and so in this case the conditioning reduces to the condition number of the Hessian matrix. In the general setting where \(F(x) =\sum _{ i=1}^{m}f_{i}(x)\) is strongly convex, the Hessian can vary with  x, and the results will depend on the Lipschitz constants  L  i of the ∇ f  i and not only of the aggregate ∇ F.

In short: with importance sampling, the convergence rate of stochastic descent is proportional to the  average conditioning \(\overline{L}/\mu = \frac{1} {m}\sum _{i=1}^{m}L_{ i}/\mu\) of the problem; without importance sampling, the convergence rate must depend on the  uniform conditioning \(\sup _{i}L_{i}/\mu\). Thus, importance sampling has the highest potential impact if the Lipschitz constants are highly variable. This is made precise in the following theorem from [26], which in the case of uniform sampling, improves on a previous result of [2].

Theorem 1.

 Let each f i  be convex where ∇f i  has Lipschitz constant L i  , with \(L_{i} \leq \sup L\)  , and let \(F(x) = \mathbb{E}f_{i}(x)\)  be μ-strongly convex. Set \(\sigma ^{2} = \mathbb{E}\|\nabla f_{i}(x_{{\ast}})\|^{2}\)  , where x  = arg \(\min _{x}F(x)\)  . Suppose that \(\gamma \leq \frac{1} {\mu }\)  . Then the SG iterates in (5)  satisfy:

$$\displaystyle{ \mathbb{E}\|x_{k} - x_{{\ast}}\|^{2} \leq \left [1 - 2\gamma \mu (1 -\gamma \sup L)\Big)\right ]^{k}\|x_{ 0} - x_{{\ast}}\|^{2} + \frac{\gamma \sigma ^{2}} {\mu \big(1 -\gamma \sup L\big)}. }$$
(9)

 where the expectation is with respect to the sampling of {i k  } in (5)  .

The parameter \(\sigma ^{2}\) should be thought of as a ‘residual’ parameter measuring the extent to which the component functions  f  i share a common minimizer. As a corollary of Theorem 1, if one pre-specifies a target accuracy \(\varepsilon > 0\), then the optimal step-size \(\gamma ^{{\ast}} =\gamma ^{{\ast}}(\varepsilon,\mu,\sigma ^{2},\sup L)\) is such that

$$\displaystyle{ k = 2\log (\varepsilon _{0}/\varepsilon )\left (\frac{\sup L} {\mu } + \frac{\sigma ^{2}} {\mu ^{2}\varepsilon }\right ) }$$
(10)

SG iterations suffice so that \(\mathbb{E}\|x_{k} - x_{{\ast}}\|_{2}^{2} \leq \varepsilon,\). See [26] for more details.

To see what this result implies for importance sampling, consider the stochastic gradient algorithm (5) with weights  w k). Then, when expectation is taken with respect to the sampling of { i  k }, we have \(F(x) = \mathbb{E}f_{i}^{(w)}(x)\) where \(f_{i}^{(w)} = \frac{1} {w^{(k)}} f_{i}\) has Lipschitz constant \(L_{i}^{(w)} = \frac{1} {w(i)}L_{i}\). The supremum of  L  i w) is then given by:

$$\displaystyle{ \sup L_{(w)} =\sup _{i}L_{i}^{(w)} =\sup _{ i} \frac{L_{i}} {w(i)}. }$$
(11)

It is easy to verify that (11) is minimized by the weights

$$\displaystyle{ w(i) = \frac{L_{i}} {\overline{L}},\quad \mbox{ so that}\quad \sup L_{(w)} =\sup _{i} \frac{L_{i}} {L_{i}/\overline{L}} = \overline{L}. }$$
(12)

Since  μ is invariant to choice of weights, we find that in the “realizable” regime where \(\sigma ^{2} = 0\), and hence \(\sigma _{(w)}^{2} = 0\), then choosing the weights  wi) as in (11) gives linear convergence with a linear dependence on the average conditioning \(\overline{L}/\mu\), and a number of iterations,

$$\displaystyle{k^{(w)} \propto \log (1/\varepsilon )\overline{L}/\mu,}$$

to achieve a target accuracy \(\varepsilon\). This strictly improves over the best possible results with uniform sampling, where the linear dependence is on the uniform conditioning \(\sup L/\mu\) (see [26] for more details).

However, when \(\sigma ^{2} > 0\), we get a potentially much  worse scaling of the second term, by a factor of \(\overline{L}/\inf L\):

$$\displaystyle\begin{array}{rcl} \sigma _{(w)}^{2}& =& \mathbb{E}^{(w)}[\|\nabla f_{ i}^{(w)}(x_{})\|_{ 2}^{2}] \leq \frac{\overline{L}} {\inf L}\sigma ^{2}.{}\end{array}$$
(13)

Fortunately, we can easily overcome this factor by sampling from a mixture of the uniform and fully weighted sampling, referred to as  partially biased sampling. Using the weights

$$\displaystyle{w(i) = \frac{1} {2} \frac{L_{i}} {\overline{L}} + \frac{1} {2}m,}$$

we have

$$\displaystyle{ \sup L_{(w)} =\sup _{i} \frac{1} {\frac{1} {2} + \frac{1} {2} \cdot \frac{L_{i}} {\overline{L}} }L_{i} \leq 2\overline{L} }$$
(14)

and

$$\displaystyle{ \sigma _{(w)}^{2} = \mathbb{E}\left [ \frac{1} {\frac{1} {2} + \frac{1} {2} \cdot \frac{L_{i}} {\overline{L}} }\|\nabla f_{i}(x_{})\|_{2}^{2}\right ] \leq 2\sigma ^{2}. }$$
(15)

In this sense, under the assumptions of Theorem 1, partially biased sampling will never be worse in terms of convergence rate than uniform sampling, up to a factor of 2, but can potentially have much better convergence.

Remark 1.

An important example where all of these parameters have explicit forms is the  least squares problem, where

$$\displaystyle\begin{array}{rcl} F(x)& =& \frac{1} {2}\|Ax - b\|_{2}^{2} = \frac{1} {2}\sum _{i=1}^{m}(\langle a_{ i},x\rangle - b_{i})^{2},{}\end{array}$$
(16)

with  b an  m-dimensional vector,  A an  m ×  n matrix with rows  a  i , and \(x_{{\ast}} = \mbox{ arg}\min _{x}\frac{1} {2}\|Ax - b\|_{2}^{2}\) is the least-squares solution. The Lipschitz constants of the components \(f_{i} = \frac{m} {2} (\langle a_{i},x\rangle - b_{i})^{2}\) are \(L_{i} = m\|a_{i}\|_{2}^{2}\), and the average Lipschitz constant is \(\frac{1} {m}\sum _{i}L_{i} =\| A\|_{F}^{2}\) where \(\|\cdot \|_{F}\) denotes the Frobenius norm. If  A is full-rank and overdetermined, then  F is strongly convex with strong convexity parameter \(\mu =\| (A^{T}A)^{-1}\|_{2}^{-1}\), so that the average condition number is \(\overline{L}/\mu \! =\!\| A\|_{F}^{2}\|(A^{T}A)^{-1}\|_{2}.\) Moreover, the residual is \(\sigma ^{2} = m\sum _{i}\|a_{i}\|^{2}\vert \langle a_{i},x_{}\rangle - b_{i}\vert ^{2}\). Observe the bounds \(\sigma ^{2} \leq n\|A\|_{F}^{2}\sup _{i}\vert \langle a_{i},x_{}\rangle - b_{i}\vert ^{2}\) and \(\sigma ^{2} \leq m\sup _{i}\|a_{i}\|^{2}\|Ax_{{\ast}}- b\|_{2}^{2}.\)

Importance Sampling for SG in other regimes

Theorem 1 is stated for smooth and strongly convex objectives, and is particularly useful in the regime where the residual \(\sigma ^{2}\) is low, and the linear convergence term is dominant. But importance sampling can be incorporated into SG methods also in other regimes, and we now briefly survey some of these possibilities.

Smooth, Not Strongly Convex

When each component  f  i is convex, non-negative, and has an  L  i -Lipschitz gradient, but the objective  Fx) is not necessarily strongly convex, then after

$$\displaystyle{ k = O\left (\frac{(\sup L)\|x_{{\ast}}\|_{2}^{2}} {\varepsilon } \cdot \frac{F(x_{{\ast}})+\varepsilon } {\varepsilon } \right ) }$$
(17)

iterations of SGD with an appropriately chosen step-size we will have \(F(\overline{x}) \leq F(x_{{\ast}})+\varepsilon\), where \(\overline{x}\) is an appropriate averaging of the  k iterates [43]. The relevant quantity here determining the iteration complexity is again \(\sup L\). Furthermore, the dependence on the supremum is unavoidable and  cannot be replaced with the average Lipschitz constant \(\overline{L}\) [43]: if we sample gradients according to the uniform distribution, we must have a linear dependence on \(\sup L\).

The only quantity in (17) that changes with a re-weighting is \(\sup L\)—all other quantities ( ∥  x  ∥ 2 2,  Fx ), and the sub-optimality \(\varepsilon\)) are invariant to re-weightings. We can therefore replace the dependence on \(\sup L\) with a dependence on \(\sup L_{(w)}\) by using a weighted SGD as in (12). As we already calculated, the optimal weights are given by (12), and using them we have \(\sup L_{(w)} = \overline{L}\). In this case, there is no need for partially biased sampling and we obtain that

$$\displaystyle{ k = O\left (\frac{\overline{L}\|x_{{\ast}}\|_{2}^{2}} {\varepsilon } \cdot \frac{F(x_{{\ast}})+\varepsilon } {\varepsilon } \right ) }$$
(18)

iterations of weighed SGD updates (5) using the weights (12) suffice.

Non-Smooth Objectives

We now turn to non-smooth objectives, where the components  f  i might not be smooth, but each component is  G  i -Lipschitz. Roughly speaking,  G  i is a bound on the first derivative (gradient) of  f  i , while  L  i is a bound on the second derivatives of  f  i . Here, the performance of SGD depends on the second moment \(\overline{G^{2}} = \mathbb{E}[G_{i}^{2}]\). The precise iteration complexity depends on whether the objective is strongly convex or whether  x is bounded, but in either case depends linearly on \(\overline{G^{2}}\).

Using weighted SGD, we get linear dependence on:

$$\displaystyle{ \overline{G_{(w)}^{2}} = \mathbb{E}^{(w)}\left [(F_{ i}^{(w)})^{2}\right ] = \mathbb{E}^{(w)}\left [ \frac{G_{i}^{2}} {w(i)^{2}}\right ] = \mathbb{E}\left [ \frac{G_{i}^{2}} {w(i)}\right ], }$$
(19)

where  F  i w) =  G  i ∕ wi) is the Lipschitz constant of the scaled  f  i w). This is minimized by the weights \(w(i) = G_{i}/\overline{G}\), where \(\overline{G} = \mathbb{E}[G_{i}]\), yielding \(\overline{G_{(w)}^{2}} = \overline{G}^{2}\). Using importance sampling, we reduce the dependence on \(\overline{G^{2}}\) to a dependence on \(\overline{G}^{2}\). It is helpful to recall that \(\overline{G^{2}} = \overline{G}^{2} + \mbox{ Var}[G_{i}]\). What we save is thus exactly the variance of the Lipschitz constants  G  i . For more details, see [46].

Importance sampling in random coordinate descent

A related stochastic optimization problem is  randomized coordinate descent, where one minimizes \(F: \mathbb{R}^{n} \rightarrow \mathbb{R}\), not necessarily having the form \(F(x) =\sum _{ i=1}^{m}f_{i}(x)\), but still assumed to be strongly convex, by decomposing its gradient (5) into its  coordinate directions

$$\displaystyle{\nabla F(x) =\sum _{ i=1}^{n}\nabla _{ i}F(x)}$$

and performing the stochastic updates:

  1. 1.

    Choose coordinate  i ∈ [ n] according to rule \(\mathbf{Prob}[i_{k} = k] = w(k)\)

  2. 2.

    Update \(x_{k+1} = x_{k} -\gamma \frac{1} {w(i_{k})}\nabla _{i_{k}}F(x_{k})\).

The motivation is that a coordinate directional derivative can be much simpler than computation of either function value, or a directional derivative along an  arbitrary direction.

Actually, Theorem 1 can also be applied to this setting; its proof from [26] uses only that

$$\displaystyle{ \nabla F(x) = \mathbb{E}[\nabla f_{i}(x)], }$$
(20)

and the fact that for, given any \(x,y \in \mathbb{R}^{n}\),

$$\displaystyle{ \|\nabla f_{i}(x) -\nabla f_{i}(y)\|_{2}^{2} \leq L_{ i}\langle x - y,\nabla f_{i}(x) -\nabla f_{i}(y)\rangle. }$$
(21)

which follows from the assumption that  f  i is smooth with Lipschitz continuous gradient by the so-called  co-coercivity Lemma, see [26, Lemma A.1]. Note that (20) still holds in the setting of randomized coordinate descent, and (21) holds if \(F: \mathbb{R}^{n} \rightarrow \mathbb{R}\) has component-wise Lipschitz continuous gradient:

$$\displaystyle{ \left \vert \nabla _{i}F(x + he_{i}) -\nabla _{i}F(x)\right \vert \leq L_{i}\vert h\vert,\quad x \in \mathbb{R}^{n},h \in \mathbb{R},i \in [n], }$$
(22)

Under these assumptions, one may consider importance sampling for random coordinate descent with weights \(w(k) = L_{k}/\sum _{j}L_{j},\) then we may apply Theorem 1 to get a linear convergence rate depending on \(\overline{L}/\mu\) as opposed to \(\sup L/\mu\). This is because coordinate descent falls into the  realizable regime, as ∇  i  Fx ) = 0 for each  i, and hence also \(\sigma ^{2} = \mathbb{E}\|(\nabla F)_{i}(x_{{\ast}})\|^{2} = 0\). Coordinate descent with importance sampling was considered before SG with importance sampling, originating in the works of [29] and [35]. One may consider the extension of randomized coordinate descent (8) to randomized  block coordinate descent, descending in  blocks of coordinates at a time. Then, the important Lipschitz constants are those associated with the  partial gradients of  F as opposed to the component-wise gradients [29].

Notes and extensions

Several aspects of importance sampling in stochastic optimization were not covered here, but we point out further results and references.

  1. 1.

    If the Lipschitz constants are not known a priori, then one could still consider doing importance sampling via  rejection sampling, simulating sampling from the weighted distribution; this can be done by accepting samples with probability proportional to \(L_{i}/\sup _{j}L_{j}\). The overall probability of accepting a sample is then \(\overline{L}/\sup L_{i}\), introducing an additional factor of \(\sup L_{i}/\overline{L}\), and thus again obtaining a linear dependence on \(\sup L_{i}\). Thus, if we are only presented with samples from the uniform distribution, and the cost of obtaining the sample dominates the cost of taking the gradient step, we do not gain (but do not lose much either) from rejection sampling. We might still gain from rejection sampling if the cost of operating on a sample (calculating the actual gradient and taking a step according to it) dominates the cost of obtaining it and (a bound on) the Lipschitz constant.

  2. 2.

    All of the convergence results we stated in this section were with respect to the expected value. Nevertheless, all these rates extend to high probability results using Chebyshev’s inequality. See [29] for more details.

  3. 3.

    Recently, several  hybrid full-gradient/stochastic gradient methods have emerged which, as opposed to pure SG as in (5), have the advantage of progressively reducing the variance of the stochastic gradient with the iterations [19, 37, 41, 42], thus allowing convergence to the true minimizer. These algorithms can further be applied to the more general class of composite problems,

    $$\displaystyle{ \mbox{ minimize}_{x\in \mathbb{R}^{n}}\left \{P(x) = F(x) + R(x)\right \}, }$$
    (23)

    where  Fx) is the average of many smooth component functions  f  i x) whose gradients have Lipschitz constants  L  i as in (5) and  Rx) is relatively simple but can be non-differentiable. These algorithms have the added complexity of requiring a single pass over the data, all having complexity \(O((n +\sup L/\mu )\log (1/\varepsilon ))\).

    As shown in [45], importance sampling can also be applied in this more general setting to speed up convergence: sampling component functions proportional to their Lipschitz constants, this complexity bound becomes \(O((n + \overline{L}/\mu )\log (1/\varepsilon ))\).

  4. 4.

    An observation that is important not only for this chapter but also for the entire discussion on importance sampling is the computational cost of implementing a random counter, that is, given values \(L_{1},L_{2},\ldots,L_{m}\), generate efficiently random integer numbers \(i\in \{ 1,2,\ldots,m\}\) with probabilities

    $$\displaystyle{ \mathbf{Prob}[i = k] = \frac{L_{k}} {\sum _{j=1}^{m}L_{j}},\quad k = 1,2,\ldots,m. }$$
    (24)

    Using a tree search algorithm [29], such a counter can be implemented with \(\log (m)\) operations, and by generating one random number.

Importance sampling in compressive sensing

Introduction

The emerging area of mathematical signal processing known as  compressive sensing is based on the observation that a signal which allows for an approximately sparse representation in a suitable basis or dictionary can be recovered from relatively few linear measurements via convex optimization, provided these measurements are sufficiently  incoherent with the basis in which the signal is sparse [8, 10, 38]. In this section we will see how importance sampling can be used to enhance the incoherence between measurements and signal basis, again, allowing for recovery from fewer linear measurements.

We illustrate the power of importance sampling through two examples: compressed sensing imaging and polynomial interpolation. In compressed sensing imaging, coherence-based sampling provides a theoretical justification for empirical studies [23, 24] pointing to variable-density sampling strategies for improved MRI compressive imaging. In polynomial interpolation, coherence-based sampling implies that sampling points drawn from the Chebyshev distribution are better suited for the recovery of polynomials and smooth functions than uniformly distributed sampling points, aligning with classical results on Lagrange interpolation [5].

Before continuing, let us fix some notation. A vector \(x \in \mathbb{C}^{N}\) is called  s-sparse if \(\|x\|_{0} = \#\{x_{j}: x_{j}\neq 0\} \leq s\), and the best  s-term approximation of a vector \(x \in \mathbb{C}^{N}\) is the  s-sparse vector \(x_{s} \in \mathbb{C}^{N}\) satisfying \(x_{s} =\inf _{u:\|u\|_{0}\leq s}\|x - u\|_{p}.\) Clearly,  x  s  =  x if  x is  s-sparse. Informally,  x is called compressible if \(\|x - x_{s}\|\) decays quickly as  s increases.

Incoherence in compressive sensing

Here we recall sparse recovery results for structured random sampling schemes corresponding to  bounded orthonormal systems, of which the partial discrete Fourier transform is a special case. We refer the reader to [15] for an expository article including many references.

Definition 1 (Bounded orthonormal system (BOS)).

Let \(\mathcal{D}\) be a measurable subset of \(\mathbb{R}^{d}\).

  • A set of functions \(\{\psi _{j}: \mathcal{D}\rightarrow \mathbb{C},\quad j \in [N]\}\) is called an  orthonormal system with respect to the probability measure  ν if \(\int _{\mathcal{D}}\bar{\psi }_{j}(u)\psi _{k}(u)d\nu (u) =\delta _{jk}\), where  δ  jk denotes the Kronecker delta.

  • Let  μ be a probability measure on \(\mathcal{D}\). A  random sample of the orthonormal system { ψ  j } is the random vector \((\psi _{1}(T),\mathop{\ldots },\psi _{N}(T))\) that results from drawing a sampling point  T from the measure  μ.

  • An orthonormal system is said to be  bounded with bound  K if \(\sup _{j\in [N]}\|\psi _{j}\|_{\infty }\leq K\).

Suppose now that we have an orthonormal system { ψ  j }  j ∈ [ N] and  m random sampling points \(T_{1},T_{2},\ldots,T_{m}\) drawn independently from some probability measure  μ. Here and throughout, we assume that the number of sampling points  m ≪  N. As shown in [15], if the system { ψ  j } is  bounded, and if the probability measure  μ from which we sample points is the orthogonalization measure  ν associated with the system, then the (underdetermined) structured random matrix \(A: \mathbb{C}^{N} \rightarrow \mathbb{C}^{m}\) whose rows are the independent random samples will be well conditioned, satisfying the so-called  restricted isometry property [11] with nearly order-optimal restricted isometry constants with high probability. Consequently, matrices associated with random samples of bounded orthonormal systems have nice sparse recovery properties.

Proposition 1 (Sparse recovery through BOS).

 Consider the matrix \(A \in \mathbb{C}^{m\times N}\)  whose rows are independent random samples of an orthonormal system {ψ j  , j ∈ [N]} with bound \(\sup _{j\in [N]}\|\psi _{j}\|_{\infty } \leq K\)  , drawn from the orthogonalization measure ν associated with the system. If the number of random samples satisfies

$$\displaystyle{ m \gtrsim K^{2}s\log ^{3}(s)\log (N), }$$
(25)

 for some \(s \gtrsim \log (N)\)  , then the following holds with probability exceeding \(1 - N^{-C\log ^{3}(s) }:\)  For each \(x \in \mathbb{C}^{N}\)  , given noisy measurements \(y = Ax + \sqrt{m}\eta\)  with \(\|\eta \|_{2} \leq \varepsilon\)  , the approximation

$$\displaystyle{x^{\#} = \mbox{ arg}\min _{ z\in \mathbb{C}^{N}}\|z\|_{1}\mbox{ subject to }\|Az - y\|_{2} \leq \sqrt{m}\epsilon }$$

 satisfies the error guarantee \(\|x - x^{\#}\|_{2} \lesssim \frac{1} {\sqrt{s}}\|x - x_{s}\|_{1} +\varepsilon.\)

An important special case of such a matrix construction is the  subsampled discrete Fourier matrix, constructed by sampling  m ≪  N rows uniformly at random from the unitary discrete Fourier matrix \(\varPsi \in \mathbb{C}^{N\times N}\) with entries \(\psi _{j,k} = \frac{1} {\sqrt{N}}e^{i2\pi (j-1)(k-1)}\). Indeed, the system of complex exponentials  ψ  j u) =  e  iπj−1) u,  j ∈ [ N], is orthonormal with respect to the uniform measure over the discrete set \(\mathcal{D} =\{ 0, \frac{1} {N},\ldots, \frac{N-1} {N} \}\), and is bounded with optimally small constant  K = 1. In the discrete setting, we may speak of a more general procedure for forming matrix constructions adhering to the conditions of Proposition 1: given any two unitary matrices  Φ and  Ψ, the composite matrix  Φ  Ψ is also a unitary matrix, and this composite matrix will have uniformly bounded entries if the orthonormal bases ( ϕ  j ) and ( ψ  k ), corresponding to the rows of  Φ and  Ψ, respectively, are  mutually incoherent:

$$\displaystyle\begin{array}{rcl} \mu (\varPhi,\varPsi )&:= \sqrt{N}\sup _{1\leq j,k\leq N}\vert \langle \phi _{j},\psi _{k}\rangle \vert \leq K.&{}\end{array}$$
(26)

Indeed, if  Φ and  Ψ are mutually incoherent, then the rows of \(B = \sqrt{N}\varPsi ^{{\ast}}\varPhi\) constitute a bounded orthonormal system with respect to the uniform measure on \(\mathcal{D} =\{ 0, \frac{1} {N},\ldots, \frac{N-1} {N} \}\). Proposition 1 then implies a sampling strategy for reconstructing signals \(x \in \mathbb{C}^{N}\) with assumed sparse representation in the basis  Ψ, that is  x =  Ψ b and  b ≈  b  s (the  s-sparse vector corresponding to its best  s-term approximation), from a few linear measurements: form a sensing matrix \(A \in \mathbb{C}^{m\times N}\) by sampling rows i.i.d. uniformly from an incoherent basis  Φ, collect measurements  y =  Axη, \(\|\eta \|_{2} \leq \epsilon\), and solve the   1 minimization program,

$$\displaystyle{x^{\#} = \mbox{ arg}\min _{ z\in \mathbb{C}^{N}}\|\varPsi ^{{\ast}}z\|_{ 1}\mbox{ subject to }\|Az - y\|_{2} \leq \sqrt{m}\epsilon.}$$

This scenario is referred to as  incoherent sampling.

Importance sampling via local coherences

Consider more generally the setting where we aim to compressively sense signals \(x \in \mathbb{C}^{N}\) with assumed sparse representation in the orthonormal basis \(\varPsi \in \mathbb{C}^{N\times N}\), but our sensing matrix \(A \in \mathbb{C}^{m\times N}\) can only consist of rows from some fixed orthonormal basis \(\varPhi \in \mathbb{C}^{N\times N}\) that is not necessarily incoherent with  Ψ. In this setting, we ask:  Given a fixed sensing basis Ψ and sparsity basis Φ, how should we sample rows of Ψ in order to make the resulting system as incoherent as possible? We will answer this question by introducing the concept of  local coherence between two bases as described in [21, 32], whereby in the discrete setting the coherences of individual elements of the sensing basis are calculated and used to derive the sampling strategy.

The following result quantifies how regions of the sensing basis that are more coherent with the sparsity basis should be sampled with higher density: they should be given more “importance”. The following is essentially a generalization of Theorem 2.1 in [32], but for completeness, we include a short self-contained proof.

Theorem 2 (Sparse recovery via local coherence sampling).

 Consider a measurable set \(\mathcal{D}\)  and a system {ψ j  , j ∈ [N]} that is orthonormal with respect to a measure ν on \(\mathcal{D}\)  which has square-integrable local coherence,

$$\displaystyle{ \sup _{j\in [N]}\vert \psi _{j}(u)\vert \leq \kappa (u),\quad \quad \int _{u\in \mathcal{D}}\vert \kappa (u)\vert ^{2}\nu (u)du = B. }$$
(27)

 We can define the probability measure \(\mu (u) = \frac{1} {B}\kappa ^{2}(u)\nu (u)\)  on \(\mathcal{D}\)  . Draw m sampling points \(T_{1},T_{2},\ldots,T_{m}\)  independently from the measure μ, and consider the matrix \(A \in \mathbb{C}^{m\times N}\)  whose rows are the random samples ψ j  (T k  ),j ∈ [N]. Consider also the diagonal preconditioning matrix \(\mathcal{P}\in \mathbb{C}^{m\times m}\)  with entries p k,k  = 1∕μ(T k  ). If the number of sampling points

$$\displaystyle{ m \gtrsim B^{2}s\log ^{3}(s)\log (N), }$$
(28)

 for some \(s \gtrsim \log (N)\)  , then the following holds with probability exceeding \(1 - N^{-C\log ^{3}(s) }.\)

 For each \(x \in \mathbb{C}^{N}\)  , given noisy measurements \(y = Ax + \sqrt{m}\eta\)  with \(\|\mathcal{P}\eta \|_{2} \leq \sqrt{m}\varepsilon\)  , the approximation

$$\displaystyle{x^{\#} = \mbox{ arg}\min _{ z\in \mathbb{C}^{N}}\|z\|_{1}\mbox{ subject to }\|\mathcal{P}Az -\mathcal{P}y\|_{2} \leq \sqrt{m}\epsilon }$$

 satisfies the error guarantee

$$\displaystyle{\|x - x^{\#}\|_{ 2} \lesssim \frac{1} {\sqrt{s}}\|x - x_{s}\|_{1} +\varepsilon.}$$

The proof is a simple change-of-measure argument following the lines of standard importance sampling principle:

Proof.

Consider the functions \(Q_{j}(u) = \frac{\sqrt{B}} {\kappa (u)} \psi _{j}(u)\). The system { Q  j } is bounded with \(\sup _{j\in [N]}\|Q_{j}\|_{\infty }\leq \sqrt{B}\), and this system is orthonormal on \(\mathcal{D}\) with respect to the sampling measure  μ:

$$\displaystyle\begin{array}{rcl} & & \int _{u\in \mathcal{D}}\bar{Q}_{j}(u)Q_{k}(u)\mu (u)du \\ & & =\int _{u\in \mathcal{D}}\left ( \frac{1} {\kappa (u)}\bar{\psi }_{j}(u)\right )\left ( \frac{1} {\kappa (u)}\psi _{k}(u)\right )\left (\kappa ^{2}(u)\nu (u)\right )du \\ & & =\int _{u\in \mathcal{D}}\bar{\psi }_{j}(u)\psi _{k}(u)\nu (u)du =\delta _{jk}. {}\end{array}$$
(29)

Thus we may apply Proposition 1 to the system { Q  j }, noting that the matrix of random samples of the system { Q  j } may be written as \(\mathcal{P}A\).

In the discrete setting where { ψ  j }  j ∈ [ N] and { ϕ  k } are rows of unitary matrices  Ψ and  Φ, and  ν is the uniform measure over the set \(\mathcal{D} =\{ 0, \frac{1} {N},\ldots, \frac{N-1} {N} \}\), the integral in condition (27) reduces to a sum,

$$\displaystyle{ \sup _{k\in [N]}\sqrt{N}\vert \langle \psi _{j},\phi _{k}\rangle \vert \leq \kappa _{j},\quad \frac{1} {N}\sum _{j=1}^{N}\kappa _{ j}^{2} = B. }$$
(30)

This motivates the introduction of the local coherence of an orthonormal basis { ϕ  j }  j = 1  N of \(\mathbb{C}^{N}\) with respect to the orthonormal basis { ψ  k }  k = 1  N of \(\mathbb{C}^{N}\):

Definition 2.

The local coherence of an orthonormal basis { ϕ  j }  j = 1  N of \(\mathbb{C}^{N}\) with respect to the orthonormal basis { ψ  k }  k = 1  N of \(\mathbb{C}^{N}\) is the function  μ  loc = ( μ  j ) ∈    N defined coordinate-wise by

$$\displaystyle{\mu _{j} =\sup \limits _{1\leq k\leq N}\sqrt{N}\vert \langle \varphi _{j},\psi _{k}\rangle \vert.}$$

We have the following corollary of Theorem 2.

Corollary 1.

 Consider a pair of orthonormal basis (Φ,Ψ) with local coherences bounded by μ j  ≤κ j  . Let s ≥ 1, and suppose that

$$\displaystyle{m \gtrsim s\left ( \frac{1} {N}\sum _{j=1}^{N}\kappa _{ j}^{2}\right )\log ^{4}(N).}$$

 Select m (possibly not distinct) rows of Φ  independent and identically distributed from the multinomial distribution on \(\{1,2,\ldots,N\}\)  with weights cκ j 2  to form the sensing matrix \(A: \mathbb{C}^{N} \rightarrow \mathbb{C}^{m}\)  . Consider also the diagonal preconditioning matrix \(\mathcal{P}\in \mathbb{C}^{m\times m}\)  with entries \(p_{k,k} = \frac{1} {\sqrt{c}\kappa _{j}}\)  . Then the following holds with probability exceeding \(1 - N^{-C\log ^{3}(s) }:\)  For each \(x \in \mathbb{C}^{N}\)  , given measurements y = Ax + η, with \(\|\mathcal{P}\eta \|_{2} \leq \sqrt{m}\varepsilon\)  , the approximation

$$\displaystyle{x^{\#} = \mbox{ arg}\min _{ u\in \mathbb{C}^{N}}\|\varPsi ^{{\ast}}u\|_{ 1}\mbox{ subject to }\|y -\mathcal{P}Au\|_{2} \leq \sqrt{m}\varepsilon }$$

 satisfies the error guarantee \(\|x - x^{\#}\|_{2} \lesssim \frac{1} {\sqrt{s}}\|\varPsi ^{{\ast}}x - (\varPsi ^{{\ast}}x)_{s}\|_{1} +\varepsilon.\)

Remark 2.

Note that the local coherence not only influences the embedding dimension  m, it also influences the sampling measure. Hence a priori, one cannot guarantee the optimal embedding dimension if one only has suboptimal bounds for the local coherence. That is why the sampling measure in Theorem 2 is defined via the (known) upper bounds  κ and \(\|\kappa \|_{2}\) rather than the (usually unknown) exact values  μ  loc and \(\|\mu _{loc}\|_{2}\), showing that local coherence sampling is  robust with respect to the sampling measure: suboptimal bounds still lead to meaningful bounds on the embedding dimension.

We now present two applications where local-coherence sampling enables a sampling scheme with sparse recovery guarantees.

Remark 3.

The \(\log (N)^{4}\) factor in the required number of measurements,  m, can be reduced to a single \(\log (N)\) factor if one asks not for  uniform sparse recovery (of the form “with high probability, this holds for all  x”) but rather a with-high probability result holding only for a particular  x (of the form “for this  x, recovery holds with high probability”). See [18] for more details.

Variable-density sampling for compressive sensing MRI

In Magnetic Resonance Imaging, after proper discretization, the unknown image \((x_{j_{1},j_{2}})\) is a two-dimensional array in \(\mathbb{R}^{n\times n}\), and allowable sensing measurements are two-dimensional Fourier transform measurements Footnote 1:

$$\displaystyle{\phi _{k_{1},k_{2}} = \frac{1} {n}\sum _{j_{1},j_{2}}x_{j_{1},j_{2}}e^{2\pi i(k_{1}j_{1}+k_{2}j_{2})/n},\quad - n/2 + 1 \leq k_{ 1},k_{2} \leq n/2.}$$

Natural sparsity domains for images, such as discrete spatial differences, are not incoherent to the Fourier basis.

A number of empirical studies, including the very first papers on compressed sensing MRI, observed that image reconstructions from compressive frequency measurements could be significantly improved by variable-density sampling.

Note that lower frequencies are more coherent with wavelets and step functions than higher frequencies. In [21], the local coherence between the two-dimensional Fourier basis and bivariate Haar wavelet basis was calculated:

Proposition 2.

 The local coherence between frequency \(\phi _{k_{1},k_{2}}\)  and the bivariate Haar wavelet basis Ψ = (ψ I  ) can be bounded by

$$\displaystyle{\mu (\phi _{k_{1},k_{2}},\varPsi ) \lesssim \frac{\sqrt{N}} {(\vert k_{1} + 1\vert ^{2} + \vert k_{2} + 1\vert ^{2})^{1/2}}.}$$

Note that this local coherence is  almost square integrable independent of discretization size n 2, as

$$\displaystyle{ \frac{1} {N}\sum _{j=1}^{N}\mu _{ j}^{2} \lesssim \log (n).}$$

Applying Corollary 1 to compressive MRI imaging, we then have

Corollary 2.

 Let \(n \in \mathbb{N}\)  . Let Ψ be the bivariate Haar wavelet basis and let \(\varPhi = (\phi _{k_{1},k_{2}})\)  be the two-dimensional discrete Fourier transform. Let s ≥ 1, and suppose that \(m \gtrsim s\log ^{5}(N)\)  . Select m (possibly not distinct) frequencies \((\phi _{k_{1},k_{2}})\)  independent and identically distributed from the multinomial distribution on \(\{1,2,\ldots,N\}\)  with weights proportional to the inverse squared Euclidean distance to the origin, \(\frac{1} {(\vert k_{1}+1\vert ^{2}+\vert k_{2}+1\vert ^{2})}\)  , and form the sensing matrix \(A: \mathbb{C}^{N} \rightarrow \mathbb{C}^{m}\)  . Then the following holds with probability exceeding \(1 - N^{-C\log ^{3}(s) }:\)  for each image \(x \in \mathbb{C}^{n\times n}\)  , given measurements y = Ax, the approximation

$$\displaystyle{x^{\#} = \mbox{ arg}\min _{ u\in \mathbb{C}^{n\times n}}\|\varPsi ^{{\ast}}u\|_{ 1}\mbox{ subject to }\|\mathcal{D}y - Au\|_{2} \leq \epsilon }$$

 satisfies the error guarantee \(\|x - x^{\#}\|_{2} \lesssim \frac{1} {\sqrt{s}}\|\varPsi ^{{\ast}}x - (\varPsi ^{{\ast}}x)_{s}\|_{1} +\varepsilon.\)

Remark 4.

This result was generalized to multidimensional wavelet and Fourier bases (not just two dimensions as considered above), and to any Daubechies wavelet basis in [20].

Remark 5.

One can prove similar guarantees as in (2) using  total variation minimization reconstruction, see [21, 25].

Sparse orthogonal polynomial expansions

Here we consider the problem of recovering polynomials  g from  m sample values \(g(x_{1}),g(x_{2})\ldots,g(x_{m})\), with sampling points  x    ∈ [−1, 1] for \(\ell= 1,\ldots,m\). If the number of sampling points is less or equal to the degree of  g, then in general such reconstruction is impossible due to dimension reasons. However, the situation becomes tractable if we make a sparsity assumption. In order to introduce a suitable notion of sparsity, we consider the orthonormal basis of Legendre polynomials.

Definition 3.

The (orthonormal) Legendre polynomials \(P_{0},P_{1},\ldots,P_{n},\ldots\) are uniquely determined by the following conditions:

  •  P  n x) is a polynomial of precise degree  n in which the coefficient of  x  n is positive,

  • the system \(\{P_{n}\}_{n=0}^{\infty }\) is orthonormal with respect to the normalized Lesbegue measure on [−1, 1]: \(\quad \frac{1} {2}\int _{-1}^{1}P_{ n}(x)P_{m}(x)dx =\delta _{n,m},\quad n,m = 0,1,2,\ldots\)

Since the interval [−1, 1] is symmetric, the Legendre polynomials satisfy  P  n x) = (−1)  n  P  n (− x). For more information see [44].

An arbitrary real-valued polynomial  g of degree  N − 1 can be expanded in terms of Legendre polynomials,

$$\displaystyle{g(x) =\sum _{ j=0}^{N-1}c_{ j}P_{j}(x),\quad x \in [-1,1]}$$

with coefficient vector \(c \in \mathbb{R}^{N}\). The vector is  s-sparse if \(\|c\|_{0} \leq s\). Given a set of  m sampling points \((x_{1},x_{2},\ldots,x_{m})\), the samples  y  k  =  gx  k ), \(k = 1,\ldots,m\), may be expressed concisely in terms of the coefficient vector according to

$$\displaystyle{y =\varPhi c,}$$

where  ϕ  k,  j  =  P  j x  k ). If the sampling points \(x_{1},\ldots,x_{m}\) are random variables, then the matrix \(\varPhi \in \mathbb{R}^{m\times N}\) is exactly the sampling matrix corresponding to random samples from the Legendre system { P  j }  j = 1  N. This is not a bounded orthonormal system, however, as the Legendre polynomials grow like

$$\displaystyle{\vert P_{n}(x)\vert \leq (n + 1/2)^{1/2},\quad - 1 \leq x \leq 1.}$$

Nevertheless the Legendre system does have bounded local coherence. A classic result from [44] follows.

Proposition 3.

 For all n > 0 and for all x ∈ [−1,1], |P n  (x)| < κ(x) = 2π −1∕2  (1 − x 2  ) −1∕4  . Here, the constant 2 π −1∕2  cannot be replaced by a smaller one.

Indeed,  κx) is a square integrable function proportional to the Chebyshev measure  π −1(1 −  x 2)−1∕2. We arrive at the following result for Legendre polynomial interpolation as a corollary of Theorem 2.

Corollary 3.

 Let \(x_{1},\ldots,x_{m}\)  be chosen independently at random on [−1,1] according to the Chebyshev measure π −1  (1 − x 2  ) −1∕2  dx. Let Ψ be the matrix with entries \(\varPsi _{k,j} = \sqrt{\pi /2}(1 - x_{k}^{2})^{1/4}P_{n}(x_{k})\)  . Suppose that

$$\displaystyle{m \gtrsim s\log ^{3}(N).}$$

 Consider the matrix \(A \in \mathbb{C}^{m\times N}\)  whose rows are independent random vectors (ψ j  (X k  )) drawn from the measure μ. If

$$\displaystyle{ m \gtrsim B^{2}s\log ^{3}(s)\log (N), }$$
(31)

 for some \(s \gtrsim \log (N)\)  , then the following holds with probability exceeding \(1 - N^{-C\log ^{3}(s) }.\)  Let \(\mathcal{D}\in \mathbb{C}^{m\times m}\)  be the diagonal matrix with entries \(d_{k,k} = \frac{1} {\mu (X_{k})}\)  . For each \(x \in \mathbb{C}^{N}\)  , given noisy measurements \(y = Ax + \sqrt{m}\eta\)  with \(\|\mathcal{D}\eta \|_{2} \leq \sqrt{m}\varepsilon\)  , the approximation

$$\displaystyle{x^{\#} = \mbox{ arg}\min _{ u\in \mathbb{C}^{N}}\|u\|_{1}\mbox{ subject to }\|\mathcal{D}Au -\mathcal{D}y\|_{2} \leq \sqrt{m}\epsilon }$$

 satisfies the error guarantee \(\|x - x^{\#}\|_{2} \lesssim \frac{1} {\sqrt{s}}\|x - x_{s}\|_{1}+\varepsilon\)  where x s  is the best s-term approximation to x.

In fact, more general theorems exist: the Chebyshev measure is a universal sampling strategy for interpolation with any set of orthogonal polynomials [32]. An extension to the setting of interpolation with spherical harmonics, and more generally, to the eigenfunctions corresponding to smooth compact manifolds, can be found in [6, 32], respectively. For extensive numerical illustrations comparing Chebyshev vs. uniform sampling, also for high-dimensional tensor-product polynomial expansions, we refer the reader to [18].

Structured sparse recovery

Often, the prior of sparsity can be refined, and additional  structure of the support set is known. In the MRI example where one senses with Fourier measurements signals which are sparse in Wavelets, the sparsity level will be higher for higher-order wavelets. One may consider sampling strategies based on a more refined notion of local coherence – based not only on \(\mu _{j} =\sup _{1\leq k\leq N}\sqrt{N}\vert \langle \phi _{j},\psi _{k}\rangle \vert \), but also coherences of sub-blocks \(\mu _{j,B_{k}} =\sup _{k\in B_{k}}\sqrt{N}\vert \langle \phi _{j},\psi _{k}\rangle \vert.\) For more information, we refer the reader to the survey article [1] and the references therein.

In fact, we also have more information about the sparsity structure in the setting of function interpolation. It is well known that the smoothness of a function is reflected in the rate of decay of its Fourier coefficients / orthonormal Legendre polynomial coefficients, and vice versa. Thus, smooth functions have directional sparsity in their orthonormal polynomial expansions: low-order and low-degree polynomials are more likely to contribute to the representation. Another way to account for directional sparsity is in the reconstruction method itself. A more general theory of sparse recovery involves  weighted ℓ 1 minimization as a reconstruction strategy, which serves as a  weighted sparse prior, and the incorporation of importance sampling there, can be found in [33].

One of the motivating applications of sparse orthogonal polynomial expansions is toward the setting of  Polynomial Chaos expansions in the area of  Uncertainty Quantification (UQ), which involves high-dimensional expensive random inputs and modeling the output as having approximately sparse expansion in a tensorized orthogonal polynomial expansion. As shown in [18], in high dimensions, local coherence sampling strategy will depend on how high is the  dimension compared to the maximal  order of orthogonal polynomial considered; for higher-order models, Chebyshev sampling is a good strategy; for low-order, high-dimensional problems, uniform sampling outperforms Chebyshev sampling. For a detailed overview and more results, we refer the reader to [18].

Importance sampling in low-rank matrix recovery

Low-rank matrix completion

The task of  low-rank matrix completion concerns the recovery of a low-rank matrix from a subset of its revealed entries, and nuclear norm minimization has emerged as an effective surrogate for this combinatorial problem. In fact, nuclear norm minimization can recover an arbitrary  n ×  n matrix of rank  r from \(\mathcal{O}(nr\log ^{2}(n))\) revealed entries, provided that revealed entries are drawn proportionally to the local row and column coherences (closely related to leverage scores) of the underlying matrix. Matrix completion has been the subject of much recent study due to its application in myriad tasks: collaborative filtering, dimensionality reduction, clustering, non-negative matrix factorization and localization in sensor networks. Clearly, the problem is ill-posed in general; correspondingly, analytical work on the subject has focused on the joint development of algorithms, and sufficient conditions under which such algorithms are able to recover the matrix.

If the true matrix is  M with entries  M  ij , and the set of observed elements is  Ω, this method guesses as the completion the optimum of the convex program:

$$\displaystyle\begin{array}{rcl} & & \min _{X}\quad \quad \|X\|_{{\ast}} \\ & &\qquad \mbox{ s.t.}\quad X_{ij}\ =\ M_{ij}\ \mbox{ for $(i,j) \in \varOmega.$}{}\end{array}$$
(32)

where the “nuclear norm” \(\|\cdot \|_{{\ast}}\) of a matrix is the sum of its singular valuesFootnote 2. Throughout, we use the standard notation \(f(n) =\varTheta (g(n))\) to mean that  cgn) ≤  fn) ≤  Cgn) for some positive constants  c,  C.

We focus on the setting where matrix entries are revealed from an underlying probability distribution. To introduce the distribution of interest, we first need a definition.

Definition 4.

For an  n 1 ×  n 2 real-valued matrix  M of rank  r with SVD given by \(U\varSigma V ^{\top }\), the  local coherences Footnote 3 –  μ  i for any row  i, and  ν  j for any column  j - are defined by the following relations

$$\displaystyle\begin{array}{rcl} \left \Vert U^{\top }e_{ i}\right \Vert & =& \sqrt{ \frac{\mu _{i } r} {n_{1}}}\quad,\quad i = 1,\ldots,n_{1} \\ \left \Vert V ^{\top }e_{ j}\right \Vert & =& \sqrt{ \frac{\nu _{j } r} {n_{2}}}\quad,\quad j = 1,\ldots,n_{2}.{}\end{array}$$
(33)

Note that the  μ  i ,  ν  j s are non-negative, and since  U and  V have orthonormal columns we always have \(\sum _{i}\mu _{i}r/n_{1} =\sum _{j}\nu _{j}r/n_{2} = r.\)

The following theorem is from [13].

Theorem 3.

 Let M = (M ij  ) be an n 1  × n 2  matrix with local coherence parameters {μ i   j  }, and suppose that its entries M ij  are observed only over a subset of elements \(\varOmega \subset [n_{1}] \times [n_{2}]\)  . There are universal constants c 0  ,c 1  ,c 2  > 0 such that if each element (i,j) is independently observed with probability p ij  , and p ij  satisfies

$$\displaystyle\begin{array}{rcl} p_{ij}\ & \geq & \ \min \left \{\ \ c_{0}\frac{(\mu _{i} +\nu _{j})r\log ^{2}(n_{1} + n_{2})} {\min \{n_{1},n_{2}\}} \ \,\ \ 1\ \ \right \}, \\ p_{ij}\ & \geq & \ \frac{1} {\min \{n_{1},n_{2}\}^{10}}, {}\end{array}$$
(34)

 then M is the unique optimal solution to the nuclear norm minimization problem (32)  with probability at least \(1 - c_{1}(n_{1} + n_{2})^{-c_{2}}\)  .

We will refer to the sampling strategy (34) as  local coherence sampling. Note that the expected number of observed entries is \(\sum _{i,j}p_{ij}\), and this satisfies

$$\displaystyle\begin{array}{rcl} \sum _{i,j}p_{ij}& \geq & \max \left \{c_{0}\frac{r\log ^{2}(n_{1} + n_{2})} {\min \{n_{1},n_{2}\}} \sum _{i,j}(\mu _{i} +\nu _{j}),\sum _{i,j} \frac{1} {n^{10}}\right \} {}\\ & =& 2c_{0}\max \left \{n_{1},n_{2}\right \}r\log ^{2}(n_{ 1} + n_{2}), {}\\ \end{array}$$

independent of the coherence, or indeed any other property, of the matrix. Hoeffding’s inequality implies that the actual number of observed entries sharply concentrates around its expectation, leading to the following corollary:

Corollary 4.

 Let M = (M ij  ) be an n 1  × n 2  matrix with local coherence parameters {μ i   j  }. Draw a subset of its entries by local coherence sampling according to the procedure described in Theorem  3  . There are universal constants c′ 1  ,c′ 2  > 0 such that the following holds with probability at least \(1 - c'_{1}(n_{1} + n_{2})^{-c'_{2}}\)  : the number m of revealed entries is bounded by

$$\displaystyle{m \leq 3c_{0}\max \left \{n_{1},n_{2}\right \}r\log ^{2}(n_{ 1} + n_{2}),}$$

 and M is the unique optimal solution to the nuclear norm minimization program  (32)  .

 (A) Roughly speaking, the condition given in (34) ensures that entries in important rows/columns (indicated by large local coherences  μ  i and  ν  j ) of the matrix should be observed more often. Note that Theorem 3 only stipulates that an  inequality relation hold between  p  ij and \(\left \{\mu _{i},\nu _{j}\right \}\). This allows for there to be some discrepancy between the sampling distribution and the local coherences. It also has the natural interpretation that the more the sampling distribution \(\left \{p_{ij}\right \}\) is “aligned” to the local coherence pattern of the matrix, the fewer observations are needed.

 (B) Sampling based on local coherences provides close to the optimal number of sampled elements required for exact recovery (when sampled with any distribution). In particular, recall that the number of degrees of freedom of an  n ×  n matrix with rank  r is 2 nr(1 −  r∕2 n). Hence, regardless how the entries are sampled, a minimum of \(\varTheta (nr)\) entries is required to recover the matrix. Theorem 3 matches this lower bound, with an additional \(O(\log ^{2}(n))\) factor.

 (C) Theorem 3 is from [13] and improves on the first results of matrix completion [7, 9, 17, 34] which assumed uniform sampling and incoherence i.e. every  μ  i  ≤  μ 0 and every  ν  j  ≤  μ 0 – and an additional  joint incoherence parameter μ  str defined by \(\|UV ^{\top }\|_{\infty } = \sqrt{ \frac{r\mu _{str } } {n_{1}n_{2}}}\). The proof of Theorem 3 involves an analysis based on bounds involving the  weighted \(\ell_{\infty,2}\) matrix norm, defined as the maximum of the appropriately weighted row and column norms of the matrix. This differs from previous approaches that use \(\ell_{\infty }\) or unweighted \(\ell_{\infty,2}\) bounds [12, 17]. In some sense, using the weighted \(\ell_{\infty,2}\)-type bounds is natural for the analysis of low-rank matrices, because the rank is a property of the rows and columns of the matrix rather than its individual entries, and the weighted norm captures the relative importance of the rows/columns.

 (D) If the column space of  M is incoherent with \(\max _{i}\mu _{i} \leq \mu _{0}\) and the row space is arbitrary, then one can randomly pick \(\varTheta (\mu _{0}r\log n)\) rows of  M and observe all their entries, and compute the local coherences of the space spanned by these rows. These parameters will be equal to the  ν  j ’s of  M with high probability. Based on these values, we can perform non-uniform sampling according to (34) and  exactly recover  M. Note that this procedure does not require any prior knowledge about the local coherences of  M. It uses a total of \(\varTheta (\mu _{0}rn\log ^{2}n)\) samples. This was observed in [22].

Theorem 3 has some interesting consequences, discussed in detail in [13] and outlined below.

  • Theorem 3 can be turned on its head, and used to quantify the benefit of  weighted nuclear norm minimization over standard nuclear norm minimization, and provide a strategy for choosing the weights in such problems given non-uniformly distributed samples so as to reduce the sampling complexity of weighted nuclear norm minimization to that of standard nuclear norm minimization. In particular, these results can provide exact recovery guarantees for weighted nuclear norm minimization as introduced in [16, 27, 39], thus providing theoretical justification for its good empirical performance.

  • Numerical evidence suggests that a two-phase adaptive sampling strategy, which assumes no prior knowledge about the local coherences of the underlying matrix  M, can perform on par with the optimal sampling strategy in completing coherent matrices, and significantly outperform uniform sampling. Specifically, [13] considers a two-phase sampling strategy whereby given a fixed budget of  m samples, one first draws a fixed proportion of samples uniformly at random, and then draw the remaining samples according to the local coherence structure of the resulting sampled matrix.