Abstract
We give lower bounds on the reconstruction error for PCA , k-means clustering, and various sparse coding methods. It is shown that the two objectives of good data approximation and sparsity of the solution are incompatible if the data distribution is evasive in the sense that most of its mass lies away from any low dimensional subspace. We give closure properties and examples of evasive distributions and quantify the extent to which distributions of bounded support and bounded density are evasive.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Much recent work in machine learning and signal processing has concentrated on the problem of approximating high dimensional data \(x\in \mathbb {R}^{N}\) by sparse linear combinations of the columns of a dictionary matrix Footnote 1 \(D=[d_{1},...,d_{K}]\in \mathbb {R}^{N\times K}\)—see, for example, [3, 4, 6–12] and references therein. For a fixed dictionary D every such linear combination has the form
where the vector of coefficients y is chosen to be a solution of the optimisation problem
Here \(\varLambda \subseteq \mathbb {R}^{K}\) is a fixed regularizing set, which implements constraints on the complexity of the chosen representations. We denote by y(x) such a solution, and it is one inherent objective that the vectors \(y\left( x\right) \) obtained should be sparse, in that the number of their nonzero coefficients is much smaller than the ambient dimension N. If \(y\left( x\right) \) is not sparse itself, it should at least have a close sparse approximation.
We assume that the data x are distributed at random according to a distribution \(\mu \) on \( \mathbb {R}^{N}\) corresponding to a random variable X taking values in \( \mathbb {R}^{N}\). The reconstruction error \(\left\| X-Dy\left( X\right) \right\| ^{2}\) is then a random variable and its expectation
measures the failure of D to match the distribution \(\mu \). Thus, given \( \varLambda \), one wishes to choose D so as to minimize \(R\left( D\right) \).
In this chapter we show that these methods are likely to produce poor results for a large class of seemingly well-behaved distributions on \( \mathbb {R}^{N}\), because the two objectives are incompatible: With high probability the representing vector \(y\left( X\right) \) is either not very sparse (or does not have a good sparse approximation), or the reconstruction error is large. Our negative results are independent of any problem of sample-based estimation and still hold if we have complete knowledge of the distribution \(\mu \).
The “bad” distributions \(\mu \) have the following property of evasiveness: For any low dimensional subspace M, the overwhelming majority of the mass of \(\mu \) is bounded away from M. Below we use the notation \(d\left( x,M\right) =\inf _{z\in M}\left\| x-z\right\| \).
Definition 24.1
Suppose \(\alpha _{k}\) is a nonincreasing sequence of positive real numbers, \(\beta \), \(C>0\). A random variable X with values in \(\mathbb {R}^{N}\) is said to be \(\left( \alpha ,\beta ,C\right) \) -evasive if for every \(k < N \), every k-dimensional subspace M of \( \mathbb {R} ^{N}\) and every \(t\in \left( 0,\alpha _{k}\right) \)
A probability measure \(\mu \) on \( \mathbb {R}^{N}\) is called \(\left( \alpha ,\beta ,C\right) \)-evasive if the corresponding random variable is.
We give two examples:
Example 24.1
(Noisy generative model) If Y is any random variable in \(\mathbb {R}^{N}\) and Z is a centered isotropic Gaussian with variance \(\sigma ^{2}\) and independent of Y, then the random variable \(X=Y+Z\) is evasive with
as will be shown in Sect. 24.3.2. With Y taking values in a union of low dimensional subspaces generated by some potentially unknown dictionary, the random variable X can be viewed as a generative model contaminated by noise. Here we will prove lower bounds in the order of \(\sigma ^{2}\).
Example 24.2
(Bounded support and bounded density) While the previous example is of a special form, this example is more generic. If a distribution \(\mu \) has support in the unit ball \(B_{N}\) of \( \mathbb {R}^{N}\) and a bounded density \(d\mu /d\rho \) with respect to the uniform measure \(\rho \) on \(B_{N}\), then \(\mu \) is evasive with
where \(\left\| .\right\| _{\infty }\) is the essential supremum norm w.r.t. \(\rho \). This will be shown in Theorem 24.3 below.
We come to the announced negative results. Suppose first that in (24.1) a hard sparsity constraint is implemented by the regularizing set
where \(\left\| y\right\| _{0}\) is the number of nonzero components of y and s is any integer \(s\le K\). An easy union bound then gives the following result:
Theorem 24.1
Let \(D=[d_{1},...,d_{K}]\in \mathbb {R}^{N\times K}\) be any dictionary and suppose that X is \(\left( \alpha ,\beta ,C\right) \)-evasive. Then for any integer \(s\le K\) and \(t\in \left( 0,\alpha _{s}\right) \)
If \(s\ln K\ll N\) the reconstruction error is bounded away from zero with high probability.
We might hope to improve this situation by requiring the encoding vectors y to be sparse only in some approximate sense. The next result holds for all vectors \(y\in \mathbb {R}^{K}\), sparse or not, and exhibits a tradeoff between the quality of two approximations: the approximation of the data by Dy and the \(\ell _{1}\)-approximation of y by its nearest vector of prescribed sparsity. For \( y=(y_{1},\dots ,y_{K})\in \mathbb {R}^{K}\) and \(s<K\) let \(y^{s}\) denote the s-sparse approximation of y, obtained by setting all components \(y_{i}\) equal to zero except for the s indices where \(\left| y_{i}\right| \) is largest.
Theorem 24.2
Let D be any dictionary with \(\left\| d_{i}\right\| =\left\| De_{i}\right\| \le B\) and suppose that X is \(\left( \alpha ,\beta ,C\right) \)-evasive. Then for every \(\delta \in \left( 0,1\right) \) we have with probability at least \(1-\delta \) for every \( y\in \mathbb {R}^{K}\) and every \(s\in \left\{ 1,...,K\right\} \) that
In many applications we can assume \(B=1\). So if \(s\ln K\ll N\) and \( \left\| y-y^{s}\right\| _{1}\) is small (so that y has a close s-sparse approximation) then the reconstruction error is of order \(\alpha _{s}\).
Below we use these results on PCA, K-means clustering and sparse coding and delimit the class of distributions to which these methods of unsupervised learning can be successfully applied.
2 Applications and Examples
The framework described in the introduction is general enough to capture many approaches to unsupervised learning.
2.1 PCA
In problem (24.1), if \(\varLambda \) is all of \(\mathbb {R }^{K}\) with \(K=s\ll N\), then an optimal D is found to be an isometry from \(\mathbb {R}^{K}\) to a maximal K-dimensional subspace of the covariance of X. The resulting method is PCA, and trivially every representing vector is s-sparse, namely y(x) has at most \(s=K\) nonzero components.
We could apply Theorem 24.1, but this would incur a superfluous term \(K\ln K\) in the exponent of the bound. Instead, by directly appealing to the definition, we find that for \(\left( \alpha ,\beta ,C\right) \)-evasive X and any dictionary D
An illustration of the evasiveness of bounded densities (Example 24.2 above) is the following: Suppose we do PCA in one thousand dimensions, and we know that the data distribution is contained in the unit ball. If we find a 100-dimensional subspace which achieves an expected reconstruction error of \( {\approx }{0.1}\), then the supremum of the distribution density (if such exists, and relative to the uniform measure on the ball) must be at least in the order of \(10^{45}\). The supremum relative to the Lebesgue measure must be at least \(10^{45}/V_{1000}\approx 10^{1800}\), where \(V_{N}\) is the volume of the unit ball in \(\mathbb {R}^{N}\). To derive this we use \(\left( \alpha _{K}-t\right) \left( 1-C\exp \left( -\beta Nt^{2}\right) \right) \) as a simple lower bound on the expected reconstruction error with \(t=0.05\), \(N=1000\), \(K=100\), \(\beta =1\), \(C=1\), equate the bound to 0.1, and solve for the bound on the density.
2.2 K-means Clustering
At the other extreme from PCA, if \(\varLambda =\left\{ e_{1},...,e_{K}\right\} \) is a basis for \(\mathbb {R}^{K}\), then the method becomes K-means clustering or vector quantization, and the optimal dictionary atoms \(d_{1},...,d_{K}\) are just the optimal centers. In this case the complexity constraint can be seen as a maximal sparsity requirement, as every member y of \(\varLambda \) satisfies \( \left\| y\right\| _{0}=1\), but we may now allow \(K>N\).
With \(\varLambda _{s}\) defined as in (24.2) we find \( \left\{ e_{1},...,e_{K}\right\} \subseteq \varLambda _{1}\), so appealing to Theorem 24.1 we find for \(\left( \alpha ,\beta ,C\right) \)-evasive X and any dictionary D
Of course there is a slight giveaway here, because \(\varLambda _{1}\) is somewhat more expressive than \(\left\{ e_{1},...,e_{K}\right\} \).
2.3 Sparse Coding Methods
To make \(\varLambda \) more expressive we can relax the extreme sparsity constraint, setting \(\varLambda =\varLambda _{s}\) with \(1\le s\ll N\). This is the situation directly addressed by Theorem 24.1, which immediately gives a lower error bound. The corresponding method is not very practical, however, because of the unwieldy nature of the \(\ell _{0}\)-function.
The alternative is to replace (24.1) with the following optimization problem
where \(\gamma \) is some positive constant, thus encouraging sparsity through the use of the \(\ell _{1}\)-norm regularizer. A large body of work has been dedicated to the study of this and related methods, the success of which depends on different coherence properties of D, see [1–3] and references therein. The search for an optimal D in this case corresponds to the sparse coding method proposed by Olshausen and Field [10], which was originally motivated by neurophysiological studies of the mammalian visual system.
A similar approach uses the initial formulation (24.1) and takes \(\varLambda \) to be a multiple of the \(\ell _{1}\)-unit ball. It relates to (24.4) as Ivanov regularization relates to Tychonov regularization.
Another example in this suite of methods is nonnegative matrix factorization, as proposed by Lee and Seung [6], where the \( d_{i}\) are required to be in the positive orthant of \( \mathbb {R}^{N}\).
Theorem 24.2 immediately applies to all these cases and shows that for evasive distributions the requirements of good data approximation and approximate sparsity are incompatible.
3 Proofs
We review our notation and then prove the announced results.
For every vector \(y\in \mathbb {R}^{K}\), we let \(\Vert y\Vert _{0}\) denote the number of nonzero components of y. We say that y is s-sparse if \( \Vert y\Vert _{0}=s\). We denote by \(y^{s}\) an s-sparse vector which is nearest to y according to the \(\ell _{1}\) norm. For every linear subspace M of \(\mathbb {R}^{N}\), we let \(P_{M}\) be the corresponding projection matrix and define \(d(x,M)=\inf _{z\in M}\Vert x-z\Vert \), namely the distance of x to the linear subspace M. Note that \(d(x,M)=\Vert P_{M^{\perp }}x\Vert \), where \(M^{\perp }\) is the orthogonal complement of M. We denote by \(\Vert \cdot \Vert \) the \(\ell _{2}\) norm of a vector and by \( |||\cdot |||\) the operator norm of a matrix. For every \(n\in \mathbb {N}\), we let \(B_{n}\) be the unit ball in \(\mathbb {R}^{n}\) and let \(V_{n}\) be its volume.
If \(\nu \) and \(\rho \) are measures on the same space and \(\nu \left( A\right) =0\) for every A satisfying \(\rho \left( A\right) =0\), then \(\nu \) is called absolutely continuous w.r.t. \(\rho \) and there exists a measurable density function \(d\mu /d\rho \), called the Radon-Nykodym derivative, such that \(d\nu =\left( d\nu /d\rho \right) d\rho \).
3.1 Limitations of Sparse Coding
We prove Theorems 24.1 and 24.2.
Proof
(Proof of Theorem 24.1) For \(S\subseteq \left\{ 1,...,K\right\} \) let \(M_{S}\) denote the subspace spanned by the \(d_{i}\) with \(i\in S\). In the event on the left-hand side of (24.3) there is some subset \(S\subseteq \left\{ 1,...,K\right\} \) with cardinality \(\left| S\right| \le s\) such that \(d\left( X,M_{S}\right) ^{2}\le \alpha _{s}-t\). The dimension of \(M_{S}\) is at most s, so we get from a union bound that
\(\square \)
Proof
(Proof of Theorem 24.2) Denote
For any \(s\in \left\{ 1,...,K\right\} \), \(x\in \mathbb {R}^{N}\) and \(y\in \mathbb {R}^{K}\) we have the following sequence of implications:
(24.5)\(\implies \)(24.6) follows from \(\left( a+b\right) ^{2}\le 2a^{2}+2b^{2}\), (24.6)\(\implies \)(24.7) from \(\left\| Dy\right\| \le B\left\| y\right\| _{1}\), because of the bound on \( \left\| d_{i}\right\| \), and (24.7)\(\implies \)(24.8) from the triangle inequality. Thus
\(\square \)
3.2 Evasive Distributions
The parameters \(\left( \alpha ,\beta ,C\right) \) of an evasive distribution transform under the operations of scaling, translation and convolution.
Proposition 24.1
Let X be \(\left( \alpha ,\beta ,C\right) \)-evasive with values in \(\mathbb {R}^{N}\). Then
-
(i)
AX is \(\left( |||A^{-1}|||^{-2}\alpha ,|||A^{-1}|||^{4}\beta ,C\right) \)-evasive for every nonsingular \(N\times N\) matrix A;
-
(ii)
cX is \(\left( c^{2}\alpha ,c^{-4}\beta ,C\right) \)-evasive for every \(c\in \mathbb {R}\);
-
(iii)
\(X+z\) is \(\left( \alpha ^{\prime },\beta ,C\right) \)-evasive with \(\alpha _{k}^{\prime }=\alpha _{k+1}\), for every \(z\in \mathbb {R}^{N}\);
-
(iv)
\(X+Z\) is \(\left( \alpha ^{\prime },\beta ,C\right) \)-evasive with \(\alpha _{k}^{\prime }=\alpha _{k+1}\), for every random variable Z independent of X.
Proof
If A is nonsingular and M is any k-dimensional subspace of \(\mathbb {R}^{N}\) then for \(z\in M\)
which shows that \(d\left( AX,M\right) \ge |||A^{-1}|||^{-1}d\left( X,A^{-1}M\right) \). We therefore have for \(t\in \left( 0,|||A^{-1}|||^{-2}\alpha \right) \) that
since \(A^{-1}M\) is also k-dimensional. (ii) is just (i) applied to \(A=cI\). (iii) follows from
and the observation that the dimension of Span\(\left( M,z\right) \) is at most \(\dim M+1\). Finally (iv) follows from (iii) by first conditioning on Z:
\(\square \)
Next we show that the normalized isotropic Gaussian in \(\mathbb {R}^{N}\) is evasive.
Proposition 24.2
Let X be an isotropic Gaussian random variable with values in \(\mathbb {R}^{N}\) and \(E\left\| X\right\| ^{2}=1\). Then for any k-dimensional subspace M we have
Proof
For any k-dimensional subspace M we consider the Gaussian random variable \(P_{M}X\) and find
We also note the Gaussian concentration property for the norm [5]
which we will use repeatedly. For a bound on the variance of the norm we first use it together with integration by parts to get
This implies that \(E\left[ \left\| P_{M}X\right\| ^{2}\right] \le \left( E\left\| P_{M}X\right\| \right) ^{2}+\pi ^{2}/N\), and hence
Observe that the latter probability is nonzero only if \(\left\| P_{M}X\right\| \le E\left\| P_{M}X\right\| \), and that by Jensen’s inequality and (24.9) \(E\left\| P_{M}X\right\| \le \sqrt{k/N}\le 1\). Using (24.10) again we therefore obtain
and applying this inequality to the orthogonal complement \(M^{\perp }\) instead of M gives the conclusion. \(\square \)
The isotropic Gaussian is thus evasive with \(\alpha _{k}=\frac{N-k-\pi ^{2}}{ N},\beta =1/2\pi ^{2},C=2\). Using Proposition 24.1 (ii) and (iv) with \(c=\sigma \) and addition of an appropriate independent random variable Y proves the claim about noisy generative models made in the introduction.
We now show that evasiveness is a generic behavior in high dimensions, if the distribution in question has a bounded support and a bounded density.
Theorem 24.3
Let the random variable X be distributed as \(\mu \) in \(\mathbb {R}^{N}\), where \(\mu \) is absolutely continuous w.r.t. the uniform measure \( \rho \) on the unit ball \(B_{N}\) of \(\mathbb {R}^{N}\). For every k let
Then for every k-dimensional subspace M we have, for \(t\in \left( 0,\alpha _{k}\right) \),
For applications of the properties of evasive distributions it is crucial that the numbers \(\alpha _{k}\) be reasonably large and decrease slowly. Figure 24.1 plots the decay of the \(\alpha _{k}\) when \(N=100\) and \(\left\| d\mu /d\rho \right\| _{\infty }=1\) (corresponding to the uniform distribution on the ball \(B_{100}\)) or \(\left\| d\mu /d\rho \right\| _{\infty }=10^{10}\) respectively.
To prove Theorem 24.3 we need the following technical lemma. Recall that we use \(V_{n}\) to denote the volume of the unit ball in \(\mathbb {R}^{n}\).
Lemma 24.1
For every \(N,k\in \mathbb {N}\) and \(1\le k<N\) we have
Proof
For simplicity we only prove the case where k and N are even. The formula
shows that
where the last inequality is a well-known bound on the binomial coefficients. The result follows. \(\square \)
Proof
(Proof of Theorem 24.3) First we work relative to Lebesgue measure \(\lambda \). Let
Fix a k-dimensional subspace M. We prove the bound by considering the worst possible density which maximizes the probability in (24.11), subject to the constraint that \(\left\| d\mu /d\lambda \right\| _{\infty }=a\) and that \(\mu \) be supported in the unit ball. Relaxing the constraint on the support of \(\mu \) from the unit ball to a larger set X will only increase the probability. We can therefore compute a bound on the probability by considering a distribution \(\mu ^{\prime }\) which maximizes it, subject to the constraint that \(\left\| d\mu ^{\prime }/d\lambda \right\| _{\infty }=a\) and that \(\mu ^{\prime }\) is supported in the cylinder \(\left( M\cap B_{N}\right) \times M^{\perp }\), which contains the unit ball \(B_{N}\). Clearly a solution to this optimization problem is given by the density
where \(r_{\max }\) is determined from the normalization requirement on \(\mu ^{\prime }\). This density \(d\mu ^{\prime }/d\lambda \) has the maximal value a on a slab of thickness \(2r_{\max }\), parallel and symmetric to \(M\cap B_{N}\) and it is zero elsewhere. If \(V_{n}\) denotes the volume of the unit ball in \( \mathbb {R} ^{n}\) the volume of this slab is \(V_{k}V_{N-k}r_{\max }^{N-k}\), from which we find
A similar computation for the volume of an analogous slab of thickness \(2 \sqrt{a_{k}-t}\) gives
where the probability is computed according to \(\mu ^{\prime }\). Now we have to show that this is bounded by \(e^{-Nt^{2}}\) for \(t\in \left( 0,\alpha _{k}\right) \).
We get from the lemma that
The last step follows from
Thus
and substitution in (24.12) gives the conclusion. \(\square \)
Notes
- 1.
Throughout the chapter, with some abuse of notation we use D to denote both the dictionary matrix and the dictionary \(D=\{d_{1},\dots ,d_{K}\}\subseteq \mathbb {R}^{N}\).
References
Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the restricted isometry property for random matrices. Constr. Approx. 28(3), 253–263 (2008)
Candès, E.J.: The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique 346(9), 589–592 (2008)
Elad, M., Aharon, M., Bruckstein, A.M.: On the uniqueness of overcomplete dictionaries and a practical way to retrieve them. Linear Algebra Appl. 416(1), 48–67 (2006)
Gribonval, R., Schnass, K.: Dictionary identification: sparse matrix-factorization via \(\ell _1\)-minimization. IEEE Trans. Inf. Theory 56(7), 3523–3539 (2010)
Ledoux, M., Talagrand, M.: Probability in Banach Spaces. Springer, Berlin (1991)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60 (2010)
Maurer, A., Pontil, M.: K-dimensional coding schemes in Hilbert spaces. IEEE Trans. Inf. Theory 56(11), 5839–5846 (2010)
Maurer, A., Pontil, M., Romera-Paredes, B.: Sparse coding for multitask and transfer learning. In: Proceedings of the 30th International Conference on Machine Learning, pp. 343–351 (2013)
Olshausen, B.A., Field, D.A.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583), 607–609 (1996)
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vis. Res. 37(23), 3311–3325 (1997)
Ranzato, M.A., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: Scholkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 1137–1144. MIT Press, Cambridge (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Maurer, A., Pontil, M., Baldassarre, L. (2015). Lower Bounds for Sparse Coding. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds) Measures of Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-21852-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-21852-6_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21851-9
Online ISBN: 978-3-319-21852-6
eBook Packages: Computer ScienceComputer Science (R0)