Keywords

1 Introduction

A blessing of dimensionality often ascribed to data sampled from genuinely high dimensional probability distributions is that pairs (and even arbitrary compact subsets) of points may be easily separated from one another with high probability [2, 4,5,6,7, 9, 13]. Such a property is naturally highly appealing for Machine Learning and Artificial Intelligence, since it suggests that if sufficiently many attributes can be obtained for each data point, then classification is a significantly easier task.

Fig. 1.
figure 1

Two unit balls separated by distance epsilon, and the optimal classifier (dotted) separating the two.

However, although this provides a useful rule of thumb, it is far from a complete description of the behaviour which may be expected of high dimensional data, and a simple experiment shows that the precise relationship between data dimension and classification performance is more subtle (see also [8], Theorem 5 and Corollary 2). Suppose that data are sampled from two classes, each described by a uniform distribution in a unit ball in \(\mathbb {R}^d\), and that the centres of these balls are at distance \(\epsilon \ge 0\) from one another, as shown in Fig. 1. The classifier which offers the optimal (balanced) accuracy in this case is given by the hyperplane which is normal to the vector connecting the two centres and positioned half way between them. In Fig. 2 we plot the accuracy of this classifier as a function of the distance separating the two centres for data sampled from various different ambient dimensions d. The insight behind the blessing of dimensionality described above is immediately clear: when the data is sampled in high dimensions, for values of \(\epsilon \) greater than some threshold value \(\epsilon _0(d)\) depending on the ambient dimension d, the accuracy of this simple linear classifier is virtually 100%. Yet, what this simplified viewpoint misses is that, for \(\epsilon < \epsilon _0(d)\) the probability of correctly classifying a given point sharply drops to close to 50%, demonstrating that raw dimensionality alone is no panacea for data classificationFootnote 1. On the other hand, data sampled even in 1 dimension may be accurately classified when the centre separation \(\epsilon \) is sufficiently large: for \(\epsilon \ge 2\) (when the two unit balls are disjoint), the two data sets are fully separable in any dimension.

What this simple thought experiment demonstrates is a fact which is not taken into account by previous work, such as [12]:

Determining whether data distributions are separable from each other must depend on a relative property of the two, and even genuine high dimensionalityFootnote 2 alone is neither a necessary nor sufficient condition for data separability.

Fig. 2.
figure 2

Accuracy of the best linear classifier separating data uniformly sampled from two balls with unit radius and centres in \(\mathbb {R}^n\) separated by distance \(\epsilon \) for different dimensions n.

To lay the foundations of our approach, we propose the new concept of the intrinsic dimension of a data distribution, based directly on the separability properties of sampled data points.

Definition 1

(Intrinsic dimension). We say that data sampled from a distribution \(\mathcal {D}\) on \(\mathbb {R}^d\) has intrinsic dimension \(n(\mathcal {D}) \in \mathbb {R}\) with respect to a centre \(c \in \mathbb {R}^d\) if

$$\begin{aligned} P(x, y \sim \mathcal {D} : (x - y, y - c) \ge 0) = \frac{1}{2^{n(\mathcal {D}) + 1}}. \end{aligned}$$
(1)

This definition is designed in such a way that the rule of thumb in the blessing of dimensionality described above becomes a law of high intrinsic dimension: points sampled from a distribution with high intrinsic dimension are highly separable. The definition is calibrated so that the uniform distribution \(\mathcal {U}(\mathbb {B}_d)\) on a d-dimensional unit ball \(\mathbb {B}_d\) satisfies \(n(\mathcal {U}(\mathbb {B}_d)) = d\) (see Theorem 1), although alternative normalisations are possible, and by symmetry \(n(\mathcal {D}) \ge 0\) for all distributions \(\mathcal {D}\). For \(c=0\), the expression \((x - y, y - c) \ge 0\) in the left-hand side of (1) is simply a statement that x and y are Fisher-separable [8].

Based on the same principle, we further introduce the concept of the relative intrinsic dimension of two data distributions, which directly describes the ease of separating data distributions.

Definition 2

(Relative intrinsic dimension). We say that data sampled from a distribution \(\mathcal {D}\) on \(\mathbb {R}^d\) has relative intrinsic dimension \(n(\mathcal {D}, \mathcal {D}^{\prime }) \in \mathbb {R}\) to data sampled from a distribution \(\mathcal {D}^{\prime }\) on \(\mathbb {R}^d\), with respect to a centre \(c \in \mathbb {R}^d\), if

$$\begin{aligned} P(x \sim \mathcal {D}^{\prime }, y \sim \mathcal {D} : (x - y, y - c) \ge 0) = \frac{1}{2^{n(\mathcal {D}, \mathcal {D}^{\prime }) + 1}}. \end{aligned}$$
(2)

The relative intrinsic dimension is not symmetric, and satisfies \(n(\mathcal {D}, \mathcal {D}^{\prime }) \ge -1\), with negative values indicating that \(\mathcal {D}\) has lower intrinsic dimension than \(\mathcal {D}^{\prime }\), and data distributions with a low relative intrinsic dimension may be separated from distributions with a high relative intrinsic dimension.

To illustrate this, consider our previous experiment as an example and let \(X = \mathcal {U}(B_1)\) and \(Y = \mathcal {U}(B_2)\), where \(B_1=\mathbb {B}_d(1,c_1) \subset \mathbb {R}^d\) and \(B_2=\mathbb {B}_d(1,c_2) \subset \mathbb {R}^d\) are the unit balls centered at \(c_1\) and \(c_2\) respectively, and pick the centre \(c = c_1\). When \(\epsilon = \Vert c_1 - c_2\Vert \ge 2\) (the case when the data distributions are completely separable), we have \(n(Y, X) = \infty \). This implies that points y sampled from Y can be separated from points sampled from points x sampled from X with certainty. The relative intrinsic dimension n(XY) is an increasing function of the dimension of the ambient space in which the data is sampled with \(n(X, Y) = 0\) in 1 dimension, implying that it becomes easier to separate points in X from points in Y as the dimension increases. These values of the relative intrinsic dimensions suggest that points from Y can easily be separated from points in X by hyperplanes normal to \(y - c_1\), while hyperplanes normal to \(x - c_1\) do not separate X from Y.

Although the asymmetry may be slightly surprising at first, it is simply reflecting the asymmetric choice of centre \(c = c_1\), which is located at the heart of the X distribution. The relative intrinsic dimensions described above would be reversed for \(c = c_2\) and would be equal for \(c = \frac{1}{2}(c_1 + c_2)\). A justification for this definition of relative intrinsic dimension is given by Theorem 2, where it is shown (in a slightly generalised setting) that these concepts of intrinsic dimension provide upper and lower bounds on classifier accuracy, indicating that it is indeed necessary and sufficient for learning.

There is a rich history of alternative charaterisations of the dimension of a data set, with each contribution typically aimed to solve a particular problem. For example, conventional Principle Components Analysis aims to detect the number of independent attributes which are actually required to represent the data, leading to compressed representations of the same data. However, as discussed above, the representational dimension of a data set does not necessarily give an indication of how easy it is to learn from. Several other notions of dimensionality are captured in the scikit-dimension library [3]. Perhaps the most similar notion of dimension to that which we propose here is the Fisher Separability Dimension [1], which is also based on the separability properties of data yet first requires a whitening step to normalise the data covariance to an identity matrix. This whitening step has both advantages and disadvantages: although it brings invariance to the choice and scaling of the basis, it disrupts the intrinsic geometry of the data. The Fisher Separability Dimension also does not address the important question of the relative dimension of data distributions and samples, which we argue is a concept fundamental to learning.

Our approach may appear reminiscent of Kernel Embeddings, through which nonlinear kernels are used to embed families of data distributions into a Hilbert space structure [11]. Although Kernel Embeddings and our work are motivated by very different classes problems, the common fundamental focus is on understanding the properties of a data distribution through the evaluation of (nonlinear) functionals of the distribution. Here we demonstrate how a single, targeted, property appears to encode important information about the separability properties of data.

An interesting question which arises from this work is how well the (relative) intrinsic dimension can be estimated from data samples directly. If it can be, then this could provide a new tool for selecting appropriate feature mappings for data and shine a new light on the training of neural networks. We briefly investigate this in Sect. 4, where we show that high order polynomial feature maps can actually be detrimental to the separability of data.

2 Separability of Uniformly Distributed Data

We investigate the separability properties of data sampled from a uniform distribution in the unit ball in various dimensions. This provides the basis for our definition of intrinsic dimension.

To simplify the presentation of our results, we introduce the following geometric quantities related to spheres in high dimensions. The volume of a ball with radius r in d dimensions is denoted by

$$ V_{d}^{{\text {ball}}}(r) = \frac{\pi ^{d/2}r^d}{\Gamma (\frac{d}{2} + 1)}, $$

and the surface area of the same ball is denoted by

$$ S_{d}^{{\text {ball}}}(r) = \frac{d\pi ^{d/2}r^{d-1}}{\Gamma (\frac{d}{2} + 1)}. $$

Similarly, the volume of the spherical cap with height h of the same sphere (i.e. the set of points \(\{x \in \mathbb {R}^d :\Vert x\Vert \le r \,\,\text {and}\,\, x_0 \ge r - h\}\)) is given by \( V_{d}^{{\text {cap}}}(r, h) = V_{d}^{{\text {ball}}}(r) W_d^{{\text {cap}}}(r, h), \) where

$$ W_d^{{\text {cap}}}(r, h) = {\left\{ \begin{array}{ll} 0 &{} \text {for } h \le 0, \\ \frac{1}{2} I_{(2rh - h^2) / r^2}(\frac{d + 1}{2}, \frac{1}{2}) &{} \text {for } 0< h \le r, \\ 1 - W_d^{{\text {cap}}}(r, 2r - h) &{} \text {for } r< h \le 2r, \\ 1 &{} \text {for } 2r < h, \end{array}\right. } $$

represents the fraction of the volume of the unit ball contained in the spherical cap. The function \(I_x(a, b) = B(a, b)^{-1} \int _0^x t^{a-1}(1-t)^{b-1} dt\) denotes the regularised incomplete beta function, where \(B(a, b) = B(1; a, b) = \frac{\Gamma (a)\Gamma (b)}{\Gamma (a + b)}\) is the standard beta function.

Fig. 3.
figure 3

The behaviour of \(f_{\theta }(d)\), formally extended to non-integer values of d, for various values of \(\theta \). The function is only invertible for \(-1 \le \theta \le 0\), and we note the asymptote of \(\frac{1}{2}\) as \(d \rightarrow 0\) when \(\theta = 0\) and as \(d \rightarrow \infty \) when \(\theta = -1\)

Theorem 1

(Separability of uniformly sampled points). Let \(\theta \in \mathbb {R}\), let d be a positive integer and suppose that \(x, y \sim \mathcal {U}(\mathbb {B}_d(1, c))\), define

$$\begin{aligned} R_{\theta }(t) = \max \Big \{ \frac{t^2}{4} - \theta , 0 \Big \}^{\frac{1}{2}}, \, a_{\theta }(t) = \frac{1 - R_{\theta }^2(t)}{t} - \frac{t}{4}, \end{aligned}$$
(3)

and

$$\begin{aligned} b_{\theta }(t) = 1 - a_{\theta }(t) - \frac{t}{2}, \end{aligned}$$
(4)

and let

$$\begin{aligned} f_{\theta }(d)&= \int _{0}^1 dt^{d-1} \big (W_{d}^{{\text {cap}}}(1, b_{\theta }(t)) + R_{\theta }^d(t) W_{d}^{{\text {cap}}}(R_{\theta }(t), R_{\theta }(t) + a_{\theta }(t)) \big ) dt. \end{aligned}$$
(5)

Then

$$\begin{aligned} P(x, y : (y - x, x - c) \ge \theta ) = f_{\theta }(d), \end{aligned}$$
(6)

and, in particular,

$$\begin{aligned} P(x, y : (y - x, x - c) \ge 0) = \frac{1}{2^{d+1}}. \end{aligned}$$
(7)

Furthermore, \(f_{\theta }\) may be simplified in the following cases as

$$\begin{aligned} f_{\theta }(d) = {\left\{ \begin{array}{ll} 1 &{}\text { for } \theta \le -2, \\ \frac{1}{2^{d + 1}} &{}\text { for } \theta = 0, \\ \int _{2\theta ^{1/2}}^1 dt^{d-1} \Big ( \frac{t^2}{4} - \theta \Big )^{d/2} dt &{}\text { for } 0< \theta < \frac{1}{4}, \\ 0 &{}\text { for } \frac{1}{4}\le \theta . \end{array}\right. } \end{aligned}$$
(8)

and \(f_{\theta }(d) \ge \frac{1}{2}\) for \(\theta \le -1\).

Fig. 4.
figure 4

The shaded area is the volume computed in the proof of Theorem 1. The two different shading colours indicate the two spherical caps used in the proof.

Proof

Without loss of generality, we suppose that \(c = 0\), and consider points \(x, y \sim \mathcal {U}(\mathbb {B}_d\)). Rearranging terms, we observe that

$$ (y - x, x) = \frac{1}{4}\Vert y\Vert ^2 - \Vert x - \frac{y}{2}\Vert ^2, $$

and therefore, for fixed y, the set of x satisfying \((y - x, x - c) \ge \theta \) may be similarly described as those points x contained within the ball

$$ \Vert x - \frac{y}{2}\Vert ^2 \le R(\Vert y\Vert ) = \max \Big \{ \frac{1}{4}\Vert y\Vert ^2 - \theta , 0 \Big \}. $$

Combining this with the condition that \(x \sim \mathbb {B}_d(1, 0)\), we find that x belongs to the intersection of the balls

$$\begin{aligned} \{ x \in \mathbb {R}^d :\Vert x\Vert \le 1 \} \cap \Big \{ x \in \mathbb {R}^d :\Vert x - \frac{y}{2}\Vert ^2 \le R_{\theta }(\Vert y\Vert ) \Big \}. \end{aligned}$$
(9)

This may be expressed as the union of two spherical caps, as depicted in Fig. 4. Comparing the triangles Opq and \(\frac{y}{2}, p, q\) shows that the lengths a and b in the Figure are exactly those defined in (3) with \(t = \Vert y\Vert \). Since y only appears through its norm, we deduce that

$$\begin{aligned} P(x :\,&(y - x, x) \ge \theta \,|\,\Vert y\Vert ) = P(x :(y - x, x) \ge \theta \,|\,y) \\ {}&= \frac{V_{d}^{{\text {cap}}}(R_{\theta }(\Vert y\Vert ), R_{\theta }(\Vert y\Vert ) + a_{\theta }(\Vert y\Vert )) + V_{d}^{{\text {cap}}}(1, b_{\theta }(\Vert y\Vert ))}{V_{d}^{{\text {ball}}}(1)}, \end{aligned}$$

The result (6) follows by applying the law of total probability, which implies

$$\begin{aligned}&P(x, y :(y - x, x) \ge \theta ) = \int _{0}^1 P(x : (y - x, x) \ge \theta \,|\,\Vert y\Vert = t) p_{\Vert y\Vert }(t) dt, \end{aligned}$$

where \(p_{\Vert y\Vert }(t) = \frac{S_{d}^{{\text {ball}}}(t)}{V_{d}^{{\text {ball}}}(1)}\) is the density associated with \(\Vert y\Vert \) for \(y \sim \mathcal {U}(\mathbb {B}_d)\).

When \(\theta \ge 0\), the ball centered at \(\frac{y}{2}\) is entirely contained within \(\mathbb {B}_d\), and so

$$\begin{aligned} P(x, y :(y - x, x) \ge \theta )&= \int _{0}^1 \frac{S_{d}^{{\text {ball}}}(t) V_{d}^{{\text {ball}}}(R_{\theta }(t))}{(V_{d}^{{\text {ball}}}(1))^2} dt \\ {}&= \int _0^1 dt^{d-1} \max \Big \{ \frac{t^2}{4} - \theta , 0 \Big \}^{d/2} dt. \end{aligned}$$

Since the integrand is zero for \(t \le 2\theta ^{1/2}\), for \(\theta \in (0, \frac{1}{4})\) we have

$$ P(x, y :(y - x, x) \ge \theta ) = \int _{2\theta ^{1/2}}^1 dt^{d-1} \Big ( \frac{t^2}{4} - \theta \Big )^{d/2} dt. $$

Moreover, \(P(x, y :(y - x, x) \ge \theta ) = 0\) for \(\theta \ge \frac{1}{4}\), and in the simplest case of \(\theta = 0\)

$$ P(x, y :(y - x, x) \ge 0) = \frac{d}{2^d} \int _0^1 t^{2d - 1} dt = \frac{1}{2^{d + 1}}. $$

On the other hand, for \(\theta \le -2\) we have \(\sqrt{R_{\theta }(t)} \ge 1 + \frac{1}{2}t\) for all t, implying that the intersection (9) is the entirity of \(\mathbb {B}_d\), and hence

$$ P \Big ( x, y :(y - x, x) \ge \theta \Big ) = 1. $$

   \(\square \)

The behaviour of \(f_{\theta }(d)\) is illustrated in Fig. 3 for various values of the separation threshold \(\theta \). Heuristically, we observe the following limiting behaviour:

$$ \lim _{d \rightarrow \infty } f_{\theta }(d) = {\left\{ \begin{array}{ll} 1 &{}\text { for } \theta < -1, \\ \frac{1}{2} &{}\text { for } \theta = -1, \\ 0 &{}\text { for } \theta > -1, \end{array}\right. } $$

which may be explained by the fact that when \(\theta = -1\), the surfaces of the ball \(\mathbb {B}_d\) and the ball centered at \(\frac{y}{2}\) meet exactly at an equator of \(\mathbb {B}_d\). The phenomenon of waist concentration (see [10], for example) implies that in high dimensions the volume of \(\mathbb {B}_d\) is concentrated around its surface and around this equator, implying that this is the threshold value of \(\theta \) at which the intersection of the two balls contains slightly more than half the volume of \(\mathbb {B}_d\).

What these results suggest is that for any value of \(\theta \in [-1, 0]\), the function \(f_{\theta }(d)\) is an invertible function of d, and hence could be used as the basis of a definition of intrinsic dimension. In Definition 1 we use the behaviour at \(\theta = 0\) to define our indicative notion of intrinsic dimension simply because it obviates the need to couple the scaling of the support of the distribution and the scaling of \(\theta \).

3 Few Shot Learning Is Dependent on Separability

We now consider the scenario of standard binary data classification, and show that the probability of successfully learning to classify data is intrinsically linked to the notion of relative intrinsic dimension. We focus on the case of learning from small data sets, as in this case the link is particularly clear to demonstrate.

Mathematically, we suppose that X and Y are (unknown) probability distributions on an d-dimensional vector space \(\mathbb {R}^d\), and we have a sample \(\{y_{i}\}_{i=1}^{k}\) of k training points sampled from Y and a sample \(\{x_{i}\}_{i=1}^{m}\) of m training points sampled from X.

Since the problem setup is symmetric in the roles of X and Y, we only analyse the influence of training data sampled from Y. The role of the data sampled from X (alongside any possible prior knowledge of the data distributions) is incorporated through an arbitrary but fixed point \(c \in \mathbb {R}^d\) in the data space.

We consider the following linear classifier to assign the label \(\ell _{X}\) to data sampled from X and the label \(\ell _{Y}\) to data sampled from Y:

$$\begin{aligned} F_{\theta }(z) = {\left\{ \begin{array}{ll} \ell _{Y} &{}\text {if } L(z) \ge \theta , \\ \ell _{X} &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$
(10)

where \(L(z) = \frac{1}{k} \sum _{i = 1}^{k} (z - y_i, y_i - c)\). In practice, the value of the threshold \(\theta \) to be used in the classifier may be determined from the training data \(\{y_i\}_{i=1}^{k}\) and \(\{x_i\}_{i=1}^{m}\), although here we consider it to be a free parameter of the classifier.

Remark 1

(Comparison with similar classifiers). The classifier (10) may be equivalently be expressed in the form of the common Fisher discriminant with a slightly different threshold, viz.

$$ F_{\theta }(z) = {\left\{ \begin{array}{ll} \ell _{Y} &{}\text {if } (z - \mu , \mu - c) \ge \theta + \Theta , \\ \ell _{X} &{}\text {otherwise}, \end{array}\right. } $$

where \(\mu = \frac{1}{k}\sum _{i=1}^{k} y_i\) and \(\Theta = \frac{1}{k}\sum _{i=1}^{k} \Vert y_i \Vert ^2- \Vert \mu \Vert ^2\). Since the offset \(\Theta \) to the threshold \(\theta \) depends only on the same training data as \(\theta \), it is clear that the classifier we study is simply a Fisher discriminant. However, we choose to write the classifier in the form (10) because it simplifies some of the forthcoming analysis.

This classifier will successfully learn to classify the training data when both

$$ P(F_{\theta }(y) = \ell _{Y}) = P(L(y) \ge \theta ) $$

is large (where the probability is taken with respect to the evaluation point \(y \sim Y\) and the training data \(\{y_i \sim Y\}_{i=1}^{k}\)), and

$$ P(F_{\theta }(x) = \ell _{X}) = P(L(x) < \theta ) $$

is also large (where the probability is taken with respect to the evaluation point \(x \sim X\) and the training data \(\{y_i \sim Y\}_{i=1}^{k}\)). We now show that both of these probabilities can be bounded from above and below by the probability of being able to separate pairs of data points by margin \(\theta \). Corollary 1 to this theorem then shows how this simply reduces to upper and lower bounds dependent on the (relative) intrinsic dimension of Y and X when \(\theta = 0\).

Theorem 2

(Pairwise separability and learning). Let \(\theta \in \mathbb {R}\) and define

$$p_\theta (Y, X) = P(x \sim X, y \sim Y : (x - y, y - c) \ge \theta ),$$

and let \(p_{\theta }(Y) = p_{\theta }(Y, Y)\). Then, the probability (with respect to the training sample \(\{y_i \sim Y\}_{i=1}^{k}\) and the evaluation point \(y \sim Y\)) of successfully learning the class Y is bounded by

$$\begin{aligned} p_\theta ^k(Y) \le P(F_{\theta }(y) = \ell _{Y}) \le 1 - (1 - p_\theta (Y))^k, \end{aligned}$$
(11)

and the probability (with respect to the training sample \(\{y_i \sim Y\}_{i=1}^{k}\) and the evaluation point \(x \sim X\)) of successfully learning the class X is bounded by

$$\begin{aligned} (1 - p_\theta (Y, X))^{k} \le P(F_{\theta }(x) = \ell _{X}) \le 1 - p_\theta ^{k}(Y, X). \end{aligned}$$
(12)

Proof

Let E be the event that \(F_{\theta }(y) = \ell _{Y}\) for \(y \sim Y\). By definition, this occurs when y and \(\{y_i\}_{i=1}^{k}\) are such that \(\sum _{i=1}^{k} (y - y_i, y_i - c) \ge k \theta \). For each \(1 \le i \le k\), let \(A_i\) denote the event that \((y - y_i, y_i - c) \ge \theta \). Then, \(\bigwedge _{i=1}^{k} A_i \Rightarrow E\) and so \( P(E) \ge P(\bigwedge _{i=1}^{k} A_i). \) We may further expand this using the law of total probability as

$$\begin{aligned} P \Big ( \bigwedge _{i=1}^{k} A_i \Big ) = \int _{\mathbb {R}^d} P \Big ( \bigwedge _{i=1}^{k} (y - y_i, y_i - c) \ge \theta \,|\,y \Big ) p(y) dy. \end{aligned}$$
(13)

Since the \(\{y_i\}_{i=1}^{k}\) are independently sampled and identically distributed, it follows that the conditional probability satisfies

$$\begin{aligned}&P \Big ( \{y_i \sim Y\}_{i=1}^{k} :\! \bigwedge _{i=1}^{k} (y - y_i, y_i - c) \ge \theta \,|\,y \Big ) \!= \! P(y^{\prime } \sim Y :(y - y^{\prime }, y^{\prime } - c) \ge \theta \,|\,y)^{k}. \end{aligned}$$

Substituting this into (13) shows that \(P \big ( \bigwedge _{i=1}^{k} A_i \big ) = \mathbb {E}_{Y} \big [ \big ( P(y^{\prime } \sim Y :(y - y^{\prime }, y^{\prime } - c) \ge \theta \,|\,y) \big )^{k} \big ]\), where the expectation is taken with respect to y. For a random variable X and a convex function g, Jensen’s inequality asserts that \(\mathbb {E}[g(X)] \ge g(\mathbb {E}[X])\). Applying this here (since the function \(g(x) = x^k\) is convex for \(k \ge 1\)), we find that

$$\begin{aligned} P \Big ( \bigwedge _{i=1}^{k} A_i \Big )&\ge \big ( \mathbb {E}_{Y} [ P(y^{\prime } :(y - y^{\prime }, y^{\prime } - c) \ge \theta \,|\,y) ] \big )^{k} \\ {}&= \big ( P(y, y^{\prime } : (y - y^{\prime }, y^{\prime } - c) \ge \theta ) \big )^{k}. \end{aligned}$$

Consequently, we deduce the lower bound of (11). The upper bound follows by arguing similarly and using the fact that \(\bigwedge _{i=1}^{k} {\text {not}} A_i \Rightarrow {\text {not}} E\), from which it follows that \(P(E) \le 1 - P(\bigwedge _{i=1}^{k} {\text {not}} A_i)\). An analogous argument shows the result (12).    \(\square \)

An immediate consequence of this theorem is that when \(\theta = 0\), the probability of successfully learning can be bounded from both above and below using the (relative) intrinsic dimension of the data distributions.

Corollary 1

(Intrinsic dimension and learning). The probability (with respect to the training sample \(\{y_i \sim Y\}_{i=1}^{k}\) and the evaluation point \(y \sim Y\)) of successfully learning the class Y is bounded by

$$\begin{aligned} \frac{1}{2^{k(n(Y) + 1)}} \le P(F_{0}(y) = \ell _{Y}) \le 1 - \Big (1 - \frac{1}{2^{n(Y) + 1}}\Big )^k, \end{aligned}$$
(14)

and the probability (with respect to the training sample \(\{y_i \sim Y\}_{i=1}^{k}\) and the evaluation point \(x \sim X\)) of successfully learning the class X is bounded by

$$\begin{aligned} 1 - \Big (1 - \frac{1}{2^{n(Y, X) + 1}}\Big )^{k} \le P(F_{0}(x) = \ell _{X}) \le \frac{1}{2^{k(n(Y, X) + 1)}} \end{aligned}$$

We note that the best lower bound which can be shown by (14) is \(\frac{1}{2}\), due to the fact that the classifier with \(\theta = 0\) will pass through the centre of the Y distribution. Despite this, Corollary 1 shows that the intrinsic dimension of Y is sufficient to know whether the probability of correctly learning the class Y is less than \(\frac{1}{2}\). Arguing symmetricaly, a more refined analysis taking more account of the training set \(\{x_i\}_{i=1}^{m}\) could instead show a version of the bound (14) which depends on the relative intrinsic dimension n(XY).

These bounds are tuned to the case when the size k of the training set sampled from Y is small, and the upper and lower bounds separate from each other as k grows, and alternative arguments would be required to get sharp bounds in the case of large k. However, even for large values of k, if the (relative) intrinsic dimension of the data distributions is sufficiently large or small, the bounds above will provide tight guarantees on the success of learning.

4 Learning with Polynomial Kernels

As an application of our proposed notion of intrinsic dimension, we use it to find the optimal polynomial kernel for a classification problem — i.e. the degree of the polynomial feature map in which two data sets become easiest to separate.

For fixed bias \(b > 1\) and polynomial degree \(k \ge 0\), let the polynomial kernel \(\kappa : \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\) be given by

$$\begin{aligned} \kappa (x, y) = (b^2 + x \cdot y)^k. \end{aligned}$$
(15)

There exists a polynomial feature map \(\phi : \mathbb {R}^d \rightarrow \mathbb {R}^N\), where \(N = \left( {\begin{array}{c}d + k\\ k\end{array}}\right) \), such that \(\kappa (x, y) = (\phi (x), \phi (y))\) (see [12], for example, for details).

Consider

$$ P(x, y, \sim \mathcal {U}(\mathbb {B}_d) : (\phi (x) - \phi (y), \phi (y) - c) \ge \theta ), $$

where \(c = \frac{1}{V_{d}^{{\text {ball}}}(1)} \int _{\mathbb {B}_d} \phi (z) dz\) is the empirical mean of the data in feature space. Then, expanding the inner product,

$$\begin{aligned}&(\phi (x) - \phi (y), \phi (y) - c) = k(x, y) - k(y, y) + \int _{\mathbb {B}_d} \frac{k(y, z) - k(x, z)}{V_{d}^{{\text {ball}}}(1)} dz \\\qquad&= (b^2 + x \cdot y)^k - (b^2 + \Vert y\Vert ^2)^k + \int _{\mathbb {B}_d} \frac{ (b^2 + y \cdot z)^k - (b^2 + x \cdot z)^k }{V_{d}^{{\text {ball}}}(1)} dz. \end{aligned}$$

Exploiting the spherical symmetry of \(\mathcal {U}(\mathbb {B}_d)\), we have

$$\begin{aligned}&\frac{1}{V_{d}^{{\text {ball}}}(1)} \int _{\mathbb {B}_d} (b^2 + x \cdot z)^k dz = \int _{-1}^{1} \frac{V_{d-1}^{{\text {ball}}}((1 - t^2)^{1/2})}{V_{d}^{{\text {ball}}}(1)} (b^2 + t\Vert x\Vert )^k dt = q(\Vert x\Vert ), \end{aligned}$$

for \(b \ge 1\), where \(q : [0, 1] \rightarrow \mathbb {R}\) is given by \( q(\Vert x\Vert ) := b^{2k} {}_2F_1\Big ( \frac{1 - k}{2}, -\frac{k}{2}; \frac{d}{2} + 1; \frac{\Vert x\Vert ^2}{b^4} \Big ), \) with \({}_2F_1\) denoting the hypergeometric function. Therefore \((\phi (x) - \phi (y), \phi (y) - c) \ge \theta \) if and only if

$$\begin{aligned} \cos (\beta (x, y))&\ge Q(\Vert x\Vert , \Vert y\Vert ) \end{aligned}$$

where \(\beta (x, y) = \arccos (\frac{(x, y)}{\Vert x\Vert \Vert y\Vert })\) denotes the angle between x and y, and

$$\begin{aligned} Q(s, t) := (st)^{-1} \Big ( \big ( \theta + (b^2 + t^2)^k + q(s) - q(t) \big )^{1/k} - b^2 \Big ). \end{aligned}$$

Geometric arguments show that for any \(\alpha \in [-1, 1]\),

$$ P(x, y \sim \mathcal {U}(\mathbb {B}_d) :\cos (\beta (x, y)) \ge \alpha \,|\,\Vert x\Vert , \Vert y\Vert ) = T_{d}^{{\text {cap}}}(\alpha ) $$

where \(T_{d}^{{\text {cap}}}(\alpha )\) denotes the proportion of the surface area of a unit sphere which falls within a spherical cap with opening angle \(\arccos (\alpha )\), given for \(d > 1\) by

$$ T_{d}^{{\text {cap}}}(\alpha ) = {\left\{ \begin{array}{ll} 0, &{}\alpha > 1, \\ \frac{1}{2}I_{(\sin (\arccos (\alpha )))^2}\Big ( \frac{d - 1}{2}, \frac{1}{2} \Big ), &{}\alpha \in [0, 1], \\ 1 - T_{d}^{{\text {cap}}}(-\alpha ), &{}\alpha \in (-1, 0), \\ 1, &{}\alpha \le -1, \end{array}\right. } $$

where \(I_x(a, b)\) is the regulalised incomplete beta function, and for \(d = 1\) by

$$ T_{1}^{{\text {cap}}}(\alpha ) = {\left\{ \begin{array}{ll} 0 \text { for } \alpha > 1; \quad \frac{1}{2} \text { for } \alpha \in (-1, 1]; \quad 1 \text { for } \alpha \le -1 \end{array}\right. } $$

Let E be the event that \(x, y \sim \mathcal {U}(\mathbb {B}_d)\) are such that \(\cos (\beta ) \ge Q(\Vert x\Vert , \Vert y\Vert )\). Then, by the law of total probability,

$$ P(E) = \int _{0}^1 \int _{0}^1 P(E \,|\,\Vert x\Vert = s, \Vert y\Vert = t) \hat{p}(s) \hat{p}(t) ds dt, $$

where \(\hat{p}(t) = \frac{S_{d}^{{\text {ball}}}(t)}{V_{d}^{{\text {ball}}}(1)} = dt^{d-1}\) denotes the density associated with \(\Vert z\Vert \) for \(z \sim \mathcal {U}(\mathbb {B}_d)\).

The arguments above therefore prove the following theorem, from which Theorem 1 arises as a simplified special case when \(k = 1\)

Theorem 3

(Separability in polynomial feature space). Let \(k > 0\), let d be a fixed positive integer, and let \(\phi \) denote the feature map associated with the polynomial kernel (15) with degree k in dimension d. Then, for \(\theta \in \mathbb {R}\),

$$\begin{aligned}&P(x, y \sim \mathcal {U}(\mathbb {B}_d) : (\phi (x) - \phi (y), \phi (y) - c) \ge \theta ) \\ {}&\qquad = d^2 \int _{0}^1 \int _{0}^1 T_{d}^{{\text {cap}}}(Q(s, t)) s^{d-1} t^{d-1} ds dt. \end{aligned}$$
Fig. 5.
figure 5

The intrinsic dimension of the image of \(\mathcal {U}(\mathbb {B}_d)\) under a polynomial feature map, for different polynomial degrees and data space dimensions d.

Figure 5 shows how the intrinsic dimension of the unit ball in various dimensions is affected by applying a polynomial feature mapping. Since the degree k polynomial feature map \(\phi : \mathbb {R}^d \rightarrow \mathbb {R}^N\), where \(N = \left( {\begin{array}{c}d + k\\ k\end{array}}\right) \), increases the apparent dimension of the space as k increases, the rule of thumb encapsulated by the blessing of dimensionality would lead us to expect that high order polynomial kernels should make the data more separable. However, this is not what we observe. Instead, the intrinsic dimension reveals that there is an ‘optimal’ polynomial degree, for which the data is most separable, and increasing the polynomial degree further beyond the point can actually have the detrimental effect of making the data less separable.

5 Conclusion

We have introduced a new notion of the intrinsic dimension of a data distribution, based on the pairwise separability properties of data points sampled from this distribution. Alongside this, we have also introduced a notion of the relative intrinsic dimension of a data distribution relative to another distribution. Theorem 2 shows how these notions of intrinsic dimension occupy a fundamental position in the theory of learning, as they directly provide upper and lower bounds on the probability of successfully learning in a generalisable fashion.

Many open questions remain, however, such as how to accurately determine the intrinsic dimension of a data distribution using just sampled data from that distribution, and how best to utilise these insights to improve neural network learning. This work also opens to door to generalising the concept beyond just simple linear functionals of the data distribution to notions of intrinsic dimensionality based around other more interesting models. The idea also generalises beyond examining individual points sampled from distributions, to studying the collective behaviour of groups, or ‘granules’ of sampled data.