1 Introduction

Both the efficiency and efficacy of fundamental operations in areas such as search and retrieval, data mining, and machine learning commonly depend on the interplay between measures of data similarity and the choice of features by which objects are represented. In settings where the number of features (the so-called representational dimension) is high, similarity values tend to concentrate strongly about their respective means, a phenomenon widely referred to as ‘the curse of dimensionality’. Consequently, as the dimensionality increases, the discriminative ability of similarity measures diminishes to a point where methods that depend on them lose their effectiveness (Weber et al. 1998; Beyer et al. 1999; Pestov 2000).

The representational dimension alone cannot explain the curse of dimensionality. This can be seen from the fact that the number of degrees of freedom within a subspace or manifold is independent of the dimension of the space in which it is embedded. This number is often described as the ‘intrinsic dimensionality’ of the manifold or subspace.

In an attempt to improve the discriminability of similarity measures, and the scalability of methods that depend on them, much attention has been given in the areas of machine learning, databases, and data mining to the development of dimensional reduction techniques. Linear techniques for dimensionality reduction include Principal Component Analysis (PCA) and its variants (Jolliffe 1986; Bouveyron et al. 2011). Non-linear dimensionality reduction methods—also known as manifold learning techniques—include Isometric Mapping (Tenenbaum et al. 2000), Multi-Dimensional Scaling (Tenenbaum et al. 2000; Venna and Kaski 2006), Locally Linear Embedding and its variants (Roweis and Saul 2000), Hessian Eigenmapping Spectral Embedding (Donoho and Grimes 2003), Local Tangent Space Alignment (Zhang and Zha 2004), and Non-Linear Component Analysis (Schölkopf et al. 1998). Most dimensional reduction techniques require that a target dimension be provided by the user, although some attempt to determine an appropriate dimension automatically. Ideally, the supplied dimension should depend on the intrinsic dimensionality (ID) of the data. This has served as a prime motivation for the development of models of ID, as well as accurate estimators.

Over the past few decades, many practical models of the intrinsic dimensionality of datasets have been proposed. Examples include the previously mentioned Principal Component Analysis and its variants (Jolliffe 1986; Bouveyron et al. 2011), as well as several manifold learning techniques (Schölkopf et al. 1998; Roweis and Saul 2000; Venna and Kaski 2006; Karhunen and Joutsensalo 1994). Topological approaches to ID estimate the basis dimension of the tangent space of the data manifold from local samples (Fukunaga and Olsen 1971; Bruske and Sommer 1998; Pettis et al. 1979; Verveer and Duin 1995). Fractal measures such as the Correlation Dimension (CD) estimate ID from the space-filling capacity of the data (Faloutsos and Kamel 1994; Camastra and Vinciarelli 2002; Gupta et al. 2003). Graph-based methods use the k-nearest neighbors graph along with density in order to measure ID (Costa and Hero 2004). Parametric modeling and estimation of distribution often allow for estimators of intrinsic dimension to be derived (Larrañaga and Lozano 2002; Levina and Bickel 2004).

The aforementioned intrinsic dimensionality measures can be described as ‘global’, in that they consider the dimensionality of a given set as a whole, without any individual object being given a special role. In contrast, ‘local’ ID measures are defined in this paper as those that involve only the k-nearest neighbor distances of a specific location in the space. Several local intrinsic dimensionality models have been proposed recently, such as the expansion dimension (ED) (Karger and Ruhl 2002), the generalized expansion dimension (GED) (Houle et al. 2012a), the minimum neighbor distance (MiND) (Rozza et al. 2012), and local continuous intrinsic dimension (which we will refer to here as LID) (Houle 2013). These models quantify ID in terms of the rate at which the number of encountered objects grows as the considered range of distances expands from a reference location.

In general, machine learning techniques that rely too strongly on local information can be accused of overfitting the data. This has motivated the development of global techniques for manifold learning such as Local Tangent Space Alignment (Zhang and Zha 2004), which first identifies manifolds restricted to neighborhoods of selected points, and then optimizes the alignment of these local structures in order to produce a more complex description of the data. The alignment process often involves an explicit penalty for overfitting. In general, local learning can compensate for overfitting by accounting for it in the final optimization process for the alignment of the local manifolds.

Local approaches can be very useful when data is composed of heterogeneous manifolds. In addition to applications in manifold learning, measures of local ID have been used in the context of similarity search, where they are used to assess the complexity of a search query (Karger and Ruhl 2002), or to control the early termination of search (Houle et al. 2012b, 2014). They have also found applications in outlier detection, in the analysis of a projection-based heuristic (Vries et al. 2012), and in the estimation of local density (von Brünken et al. 2015). The efficiency and effectiveness of the algorithmic applications of intrinsic dimensional estimation (such as Houle et al. 2012b, 2014) depends greatly on the quality of of the estimators employed.

Distances from a query point can be seen as realizations of a continuous positive random variable. In this case, the smallest distances encountered would be ‘extreme events’ associated with the lower tail of the underlying distance distribution. In Extreme Value Theory (EVT), a discipline of statistics concerned with the study of tails of continuous probability distributions, the random variable associated with nearest neighbor distances can be assumed to follow a power-law distribution, where the exponent can be viewed as a form of dimension (Coles et al. 2001). Specifically, continuous lower-bounded random variables are known to asymptotically converge to the Weibull distribution as the sample size grows, regardless of the original distance measure and its distribution. In an equivalent formulation of EVT due to Karamata, the cumulative distribution function of a tail distribution can be represented in terms of a regularly-varying (RV) function whose dominant factor is a polynomial in the distance (Coles et al. 2001; Houle 2015); the degree (or ’index’) of this polynomial factor determines the shape parameter of the associated Weibull distribution, or equivalently the exponent of the associated power law. The index has been interpreted as a form of intrinsic dimension (Coles et al. 2001). Maximum likelihood estimation of the index leads to the well-known Hill estimator for power-law distributions (Hill 1975).

While EVT provides an asymptotic description of tail distributions, in the case of continuous distance distributions, the distribution can be exactly characterized in terms of LID (Houle 2015). The LID model introduces a function that assesses the discriminative power of the distribution at any given distance value (Houle 2013, 2015). A distance measure is described as ‘discriminative’ when an expansion in the distance results in a relatively small increase in the number of observations. This function is shown to fully characterize the cumulative distribution function without the explicit involvement of the probability density (Houle 2015). The limit of this function yields the skewness of the Weibull distribution (or equivalently, the Karamata representation index, or power law exponent) associated with the lower tail. It is the estimation of this limit that is the main focus of this paper.

In addition to the more traditional applications stated earlier, LID has the potential for wide application in many machine learning and data mining contexts, as it makes no assumptions on the nature of the data distribution other than continuity. Moreover, the interpretation of continuous ID in terms of the indiscriminability of the distance measure naturally lends itself to the design of outlier detection techniques (von Brünken et al. 2015), and in the understanding of density-related phenomena such as the hubness of data (Radovanović et al. 2010a, b; Houle 2015).

The original contributions of this paper can be summarized as:

  • A framework for the estimation of local continuous intrinsic dimension (LID) using well-established techniques: the maximum likelihood estimation (MLE), the method of moments (MoM), and the method of probability-weighted moments (PWM). In particular, we verify that applying MLE to LID leads to the well-known Hill estimator (Hill 1975).

  • A new family of estimators based on the extreme-value-theoretic notion of regularly varying functions. Several existing dimensionality models (ED, GED, and MiND) are shown to be special cases of this family,

  • Confidence intervals for the variance and convergence of the estimators we propose.

  • An experimental study using artificial data and synthetic distance distributions, in which we compare our estimators with state-of-the-art global and local estimators. We also show that the empirical variance and convergence rates of the MLE (Hill) and MoM estimators are superior to those of the other local estimators studied.

  • Experiments showing that local estimators are more robust than global ones in the presence of noise in nonlinear manifolds. Our experiments show that our approaches are very competitive in this regard with other methods, both local and global.

  • An experimental study showing the effectiveness of LID estimation when using approximate nearest neighbors.

  • Profiles of several real-world datasets in terms of LID, illustrating the degree of variability of complexity from region to region within a dataset. The profiles demonstrate that a single ‘global’ ID value is in general not sufficient to fully characterize the complexity of real-world data.

A preliminary version of this work was published in Amsaleg et al. (2015).

The remainder of the paper is structured as follows. The next section provides a brief introduction of the framework of continuous ID. Subsequently, in Sect. 3, we explain the relationship between continuous ID and central results from the statistical discipline of extreme value theory. Using the theory, in Sect. 4 we propose and analyze several estimators of continuous ID, using maximum likelihood estimation (MLE, which yields the Hill estimator), the method of moments (MoM), probability weighted moments (PWM), and regularly varying functions (RV). In Sect. 5 we present our experimental study, and discuss the practical performance of our proposed estimators. We conclude the paper in Sect. 7 with a discussion of potential future applications.

2 Continuous intrinsic dimension

In this section, we survey local intrinsic dimensionality (LID), an extension of a well-studied model of intrinsic dimensionality to continuous distributions of distances proposed in Houle (2013). LID aims to quantify the local ID of a feature space exclusively in terms of the distribution of inter-point distances. Formally, let \(({\mathbb {R}}^{m},\mathrm {dist})\) be a domain equipped with a non-negative distance function \(\mathrm {dist}\). Let us consider the distribution of distances within the domain with respect to some fixed point of reference. We model this distribution in terms of a random variable \(\mathbf {X}\) with support \([0,\infty )\). \(\mathbf {X}\) is said to have probability density \(f_{\mathbf {X}}\), where \(f_{\mathbf {X}}\) is a non-negative Lebesgue-integrable function, if and only if

$$\begin{aligned} \mathsf {Pr}\left[ a \le \mathbf {X} \le b \right] = \int _{a}^{b} f_{\mathbf {X}}(x) \, \mathrm {d}x, \end{aligned}$$

for any \(a,b \in [0,\infty )\) such that \(a \le b\). The corresponding cumulative density function \(F_{\mathbf {X}}\) is canonically defined as

$$\begin{aligned} F_{\mathbf {X}}(x) = \mathsf {Pr}\left[ \mathbf {X} \le x \right] = \int _{0}^{x} f_{\mathbf {X}}(u) \, \mathrm {d}u. \end{aligned}$$

Accordingly, whenever \(\mathbf {X}\) is absolutely continuous at x, \(F_{\mathbf {X}}\) is differentiable at x and its first-order derivative is \(f_{\mathbf {X}}(x)\). For such settings, the local intrinsic dimension is defined as follows.

Definition 1

(Houle 2013) Given an absolutely continuous random distance variable \(\mathbf {X}\), for any distance threshold x such that \(F_{\mathbf {X}}(x) > 0\), the local continuous intrinsic dimension of \(\mathbf {X}\) at distance x is given by

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}}(x)&\triangleq \lim \nolimits _{\epsilon \rightarrow 0^{+}} \frac{\ln F_{\mathbf {X}} \left( (1 + \epsilon )x \right) - \ln F_{\mathbf {X}}(x)}{\ln (1 + \epsilon )}, \end{aligned}$$

wherever the limit exists.

With respect to the generalized expansion dimension (Houle et al. 2012a), a precursor of LID, the above definition of \(\mathrm {ID}_{\mathbf {X}}(x)\) is the outcome of a dimensional test of neighborhoods of radii x and \((1 + \epsilon )x\) in which the neighborhood cardinalities are replaced by the expected number of neighbors. LID also turns out to be equivalent to a formulation of the (lack of) discriminative power of a distance measure, as both formulations have the same closed form:

Theorem 1

(Houle 2013) Let \(\mathbf {X}\) be an absolutely continuous random distance variable. If \(F_{\mathbf {X}}\) is both positive and differentiable at x, then

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}}(x) = \frac{x f_{\mathbf {X}}(x)}{F_{\mathbf {X}}(x)}. \end{aligned}$$

Local ID has potential for wide application thanks to its very general treatment of distances as continuous random variable. Direct estimation of \(\mathrm {ID}_{\mathbf {X}}\), however, requires the knowledge of the distribution of \(\mathbf {X}\). Extreme value theory, which we survey in the following section, allows the estimation of the limit of \(\mathrm {ID}_{\mathbf {X}}(x)\) as x tends to 0 without any explicit assumptions of the data distribution other than continuity.

3 Extreme value theory

Extreme value theory is concerned with the modeling of what can be regarded as the extreme behavior of stochastic processes. It has seen applications in areas such as civil engineering (Harris 2001), operations research (Tryon and Cruse 2000; McNulty et al. 2000; Dahan and Mendelson 2001), risk assessment (Lavenda and Cipollone 2000), material sciences (Grimshaw 1993), bioinformatics (Roberts 2000), geophysics (Lavenda and Cipollone 2000), and multimedia (Furon and Jégou 2013).

We begin the introduction of extreme value theory with the following definition.

Definition 2

Let \(\mu , \xi \in {\mathbb {R}}\) and \(\sigma > 0\). The family of generalized extreme value distributions \({\mathcal {F}}_{\scriptscriptstyle \mathrm {GEV}}\) covers distributions whose cumulative distribution functions have the form

$$\begin{aligned} {\mathcal {F}}_{\scriptscriptstyle \mathrm {GEV}}= \left\{ \begin{array}{l l} \exp \left( - \left[ 1 + \xi \left( \frac{x - \mu }{\sigma }\right) \right] ^{- \frac{1}{\xi }} \right) &{} \text {if }\xi \ne 0 \\ \exp \left( - \exp \left( {-\frac{x-\mu }{\sigma }}\right) \right) &{} \text {if }\xi =0 \end{array} \right\} . \end{aligned}$$

A distribution \(G \in {\mathcal {F}}_{\scriptscriptstyle \mathrm {GEV}}\) has support \(\mathop {\mathrm {supp}}(G) = [\mu - \frac{\sigma }{\xi }, \infty )\) whenever \(\xi > 0\) and \(\mathop {\mathrm {supp}}(G) = (-\infty , \mu - \frac{\sigma }{\xi }]\) when \(\xi < 0\). If \(\xi = 0\), the support covers the complete real line.

Its best known theorem, attributed in parts to Fisher and Tippett (1928), and Gnedenko (1943), states that the maximum of n independent identically-distributed random variables (after proper renormalization) converges in distribution to a generalized extreme value distribution as n goes to infinity.

Theorem 2

(Fisher–Tippet–Gnedenko) Let \((\mathbf {X}_i)_{i \in {\mathbb {N}}}\) be a sequence of independent identically-distributed random variables and let \(\mathbf {M}_n = \max _{1 \le i \le n} \mathbf {X}_i\). If there exist a sequence of positive constants \((a_i)_{i \in {\mathbb {N}}}\), and a sequence of constants \((b_i)_{i \in {\mathbb {N}}}\), such that

$$\begin{aligned} \lim _{n \rightarrow \infty } \mathsf {Pr}\left[ \frac{\mathbf {M}_n - b_n}{a_n} \le x \right] = F(x), \end{aligned}$$

for any \(x \in [0, 1]\), where F is a non-degenerate distribution function, then \(F \in {\mathcal {F}}_{\scriptscriptstyle \mathrm {GEV}}\).

Extreme value theory mainly draws its power from two major results due to Fisher, Tippett and Gnedenko, as well as Balkema, de Haan and Pickands (Fisher and Tippett 1928; Gnedenko 1943; Balkema and De Haan 1974; Pickands 1975). Consider the following parametric family of distributions. This theorem is useful when observing several samples, each containing N occurrences, to estimate the distribution of the sample maxima. However, in our setting, we have only one sample; we are not interested in its extremum, but only in the \(n>1\) extreme values. For this reason, we switch to two alternative approaches to modeling extremal behavior which are more suitable to our application: threshold excesses, and regularly-varying functions.

3.1 Threshold excesses

Consider the following two definitions.

Definition 3

Let \(\xi \in {\mathbb {R}}\) and \(\sigma > 0\). The family of generalized Pareto distributions \({\mathcal {F}}_{\scriptscriptstyle \mathrm {GPD}}\) is defined by its cumulative distribution function

$$\begin{aligned} {\mathcal {F}}_{\scriptscriptstyle \mathrm {GPD}}= \left\{ \begin{array}{l l} 1 - \left( 1 + \frac{\xi x}{\sigma }\right) ^{- \frac{1}{\xi }} &{} \text {if }\xi \ne 0 \\ 1- e^{-\frac{x}{\sigma }} &{} \text {if }\xi =0 \end{array} \right\} . \end{aligned}$$

Every distribution \(G \in {\mathcal {F}}_{\scriptscriptstyle \mathrm {GPD}}\) has support \(\mathop {\mathrm {supp}}(G) = (\max \{ 0, - \frac{\sigma }{\xi } \},\infty )\).

Definition 4

Let \(\mathbf {X}\) be a random variable whose distribution \(F_{\mathbf {X}}\) has the upper endpoint \(x^{+} \in {\mathbb {R}} \cup \{ \infty \}\). Given \(w < x^{+}\), the conditional excess distribution \(F_{\mathbf {X ,w }}\) of \(\mathbf {X}\) is the distribution of \(\mathbf {X} - w\) conditioned on the event \(\mathbf {X} > w\):

$$\begin{aligned} F_{\mathbf {X ,w }}(x) = \frac{F_{\mathbf {X}}(w + x) - F_{\mathbf {X}}(w)}{1 - F_{\mathbf {X}}(w)}. \end{aligned}$$

We are now in a position to introduce a powerful theorem due to Balkema and De Haan (1974), Pickands (1975), which can be regarded as the counterpart to the central limit theorem for extremal statistics.

Theorem 3

(Pickands–Balkema–de Haan) Let \((\mathbf {X}_{i})_{i \in {\mathbb {N}}}\) be a sequence of independent random variables with identical distribution function \(F_{\mathbf {X}}\) satisfying the conditions of the Fisher-Tippett-Gnedenko Theorem. As \(w\rightarrow x^{+}\), \(F_{\mathbf {X ,w }}(x)\) converges to a distribution in \({\mathcal {F}}_{\scriptscriptstyle \mathrm {GPD}}\).

In the following we demonstrate a direct relation between local ID and extreme value theory, which arises as an implication of Theorem 3. Note that any choice of distance threshold w corresponds to a neighborhood of radius w based at the reference point, or equivalently, to the tail of the distribution of distances on [0, w). As discussed in Coles et al. (2001), Theorem 3 also applies to lower tails: one can reason about minima using the transformation \(\mathbf {Y} = {-\mathbf {X}}\). The distribution of the excess \(\mathbf {Y} - (-w)\) (conditioned on \(\mathbf {Y} > {-w}\)) then tends to a distribution in \({\mathcal {F}}_{\scriptscriptstyle \mathrm {GPD}}\), as w tends to the lower endpoint of \(F_{\mathbf {X}}\) located at zero (Nett 2014). Accordingly, as w tends to zero, the distribution in the tail [0, w) can be restated as follows Coles et al. (2001).

Lemma 1

Let \(\mathbf {X}\) be an absolutely continuous random distance variable with support \([0,\infty )\) and cumulative distribution function \(F_{\mathbf {X}}\) such that \(F_{\mathbf {X}}(x)>0\) if \(x>0\). Let \(c\in (0,1)\) be an arbitrary constant. Let \(w > 0\) be a distance threshold, and consider x restricted to the range [cww). As w tends to zero, the distribution of \(\mathbf {X}\) restricted to the tail [cww) satisfies, for some fixed \(\xi < 0\),

$$\begin{aligned} \frac{ (x/w)^{- \frac{1}{\xi }} }{F_{\mathbf {X ,w }}(x)} \rightarrow 1. \end{aligned}$$

Note that the distribution of excess distance \(w - \mathbf {X}\) is bounded from above by w which, according to Coles et al. (2001), enforces that \(\xi < 0\).

Proof

Consider the distribution of threshold excess \(w - \mathbf {X}\) with \(\mathbf {X}\) restricted to [cww). According to Theorem 3, \(w - \mathbf {X}\) asymptotically follows a generalized Pareto distribution:

$$\begin{aligned} \mathsf {Pr}\left[ w - \mathbf {X} \le y \right] \rightarrow 1 - \left( 1 + \frac{\xi y}{\sigma } \right) ^{- \frac{1}{\xi }}, \end{aligned}$$

with \(\sigma > 0\) and \(\xi < 0\), so that

$$\begin{aligned} \mathsf {Pr}\left[ \mathbf {X} \le w - y \right] \rightarrow \left( 1 + \frac{\xi y}{\sigma } \right) ^{- \frac{1}{\xi }}. \end{aligned}$$

Since a distance x corresponds to a threshold excess of \(w - y\),

$$\begin{aligned} F_{\mathbf {X ,w }}(x) = \mathsf {Pr}\left[ \mathbf {X} \le x \right] \rightarrow \left( 1 + \frac{\xi (w - x)}{\sigma } \right) ^{{- \frac{1}{\xi }}}. \end{aligned}$$

We see that \(F_{\mathbf {X ,w }}(0) = 0\) holds if and only if

$$\begin{aligned} \left( 1 + \frac{\xi (w-x)}{\sigma } \right) ^{{- \frac{1}{\xi }}} = 0, \end{aligned}$$

implying that \(\sigma = {-\xi } w\). With this additional constraint, the distribution of distances in the tail [cww) simplifies to

$$\begin{aligned} \frac{ (x/w)^{- \frac{1}{\xi }} }{F_{\mathbf {X ,w }}(x)} \rightarrow 1. \end{aligned}$$

\(\square \)

To summarize, whenever Theorem 3 applies to a distance variable \(\mathbf {X}\), the cumulative distribution of distances within a radius-w neighborhood is asymptotically determined by a single parameter \(\xi < 0\). We can prove the following statement concerning LID.

Theorem 4

Let \(\mathbf {X}\) be an absolutely continuous random distance variable with support \([0,\infty )\), satisfying the conditions of Theorem 3, and \(w > 0\) be a distance threshold. Then, as w tends to zero,

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}}(w) \rightarrow {- \frac{1}{\xi }} =: \mathrm {ID}_{\mathbf {X}}. \end{aligned}$$

Proof

(Sketch only. For a more detailed and rigorous treatment, see Houle 2015.) Lemma 1 states that under the conditions of Theorem 4, the cumulative excess distribution \(F_{\mathbf {X ,w }}\) follows

$$\begin{aligned} \frac{ (x/w)^{- \frac{1}{\xi }} }{F_{\mathbf {X ,w }}(x)} \rightarrow 1 \end{aligned}$$

as the threshold w approaches zero. The probability density \(f_{\mathbf {X ,w }}\) in the tail of the distribution is obtained by taking the derivative with respect to x:

$$\begin{aligned} f_{\mathbf {X ,w }}(x) \approx \frac{\partial }{\partial x} \left( \frac{x}{w} \right) ^{{- \frac{1}{\xi }}} = {- \frac{1}{\xi w}} \left( \frac{x}{w} \right) ^{-\frac{1}{\xi } - 1}. \end{aligned}$$

This formulation relies on the smoothness properties of slowly varying functions (c.f. Sect. 3.2 and Bingham et al. 1989).

Applying Theorem 1 gives

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}}(x) \approx \frac{x \cdot f_{\mathbf {X ,w }}(x)}{F_{\mathbf {X ,w }}(x)} \rightarrow {- \frac{1}{\xi }}. \end{aligned}$$

\(\square \)

Note that together Lemma 1 and Theorem 4 allow us to restate the asymptotic cumulative distribution of distances in the tail [cww) as

$$\begin{aligned} \frac{ (x/w)^{\mathrm {ID}_{\mathbf {X}}} }{F_{\mathbf {X ,w }}(x)} \rightarrow 1. \end{aligned}$$
(1)

3.2 Regularly-varying functions

The Fisher-Tippett-Gnedenko Theorem and the Pickands-Balkema-de Haan Theorem have been shown to be equivalent to a third characterization of the tail behavior, in terms of regularly-varying (RV) functions. The asymptotic cumulative distribution of \(\mathbf {X}\) in the tail [0, w) can be expressed as \(F_{\mathbf {X}}(x) = x^{\kappa } \ell _{\mathbf {X}}(1/x)\), where \(\ell _{\mathbf {X}}\) is differentiable and slowly varying; that is, for all \(c>0, \ell _{\mathbf {X}}\) satisfies

$$\begin{aligned} \lim _{t\rightarrow \infty } \frac{\ell _{\mathbf {X}}(ct)}{\ell _{\mathbf {X}}(t)} = 1. \end{aligned}$$

\(F_{\mathbf {X}}\) restricted to [0, w) is itself said to be regularly varying with index \(\kappa \). In particular, a cumulative distribution \(F\in {\mathcal {F}}_{\scriptscriptstyle \mathrm {GEV}}\) has \(\xi <0\) if and only if F is RV and has a finite endpoint. Note that the slowly-varying component \(\ell _{\mathbf {X}}(1/x)\) of \(F_{\mathbf {X}}\) is not necessarily constant as x tends to zero. For a detailed account of RV functions, we refer the reader to Bingham et al. (1989).

The following corollary is a straightforward extension of the examples given in Sect. 2.

Corollary 1

Let \(\mathbf {X}\) be a random distance variable restricted to [0, w) with distribution \(F_{\mathbf {X}}(x) = x^{\kappa } \ell _{\mathbf {X}}(1/x)\). As w tends to zero, the index \(\kappa \) converges to \(\mathrm {ID}_{\mathbf {X}}\).

Proof

The probability density function associated with \(F_{\mathbf {X}}\) is

$$\begin{aligned} f_{\mathbf {X}}(x) = \kappa x^{\kappa -1} \ell _{\mathbf {X}}(1/x) - x^{\kappa -2} \ell _{\mathbf {X}}'(1/x). \end{aligned}$$

From Theorem 1, we have

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}}(x) = \frac{x f_{\mathbf {X}}(x)}{F_{\mathbf {X}}(x)} = \kappa - \frac{1}{x} \frac{\ell _{\mathbf {X}}'(1/x)}{\ell _{\mathbf {X}}(1/x)}. \end{aligned}$$

The slowly varying property of \(\ell _{\mathbf {X}}\) ensures that

$$\begin{aligned} \lim _{x\rightarrow 0} \frac{1}{x} \frac{\ell _{\mathbf {X}}'(1/x)}{\ell _{\mathbf {X}}(1/x)}=0. \end{aligned}$$

Therefore, the index \(\kappa \) converges to \(\mathrm {ID}_{\mathbf {X}}\) as w tends to zero. \(\square \)

The following section is concerned with deriving estimators of the intrinsic dimension based on the asymptotic distribution of distances within neighborhoods stated in Eq. 1.

4 Estimation

This section is concerned with practical methods for the estimation of the local intrinsic dimension of a random distance variable \(\mathbf {X}\). In particular, we adapt known GPD parameter estimators such as the maximum-likelihood estimator (in Sect. 4.1) and moment based estimators (in Sects. 4.2, 4.3 ), and propose a new family of estimators based on regularly varying functions (in Sect. 4.4).

For the remainder of this discussion, we assume that we are given a distance threshold \(w > 0\) and a sequence \(x_1, \ldots , x_{k}\) of independent observations of a random distance variable \(\mathbf {X}\) with support [0, w). Without loss of generality, we assume that the observations are given in ascending order—that is, \(x_1 \le x_2 \le \cdots \le x_{k}\).

4.1 Maximum likelihood estimation

Maximization of the likelihood function is one of the most widely used parameter estimation techniques in statistics. The Maximum Likelihood Estimator (MLE) has no optimality guarantees for finite samples, but has the advantage of being asymptotically consistent, optimal, and efficient (in that it achieves the Cramer-Rao bound).

Definition 5

Given a random variable X with parameter \(\theta \), the likelihood of \(\theta \) as a function of observations \(x_1, x_2, \ldots , x_{k}\) is defined as

$$\begin{aligned} L(\theta |\, x_1, \ldots , x_{k}) = \prod _{i=1}^{k} f(x_i |\, \theta )\,. \end{aligned}$$

Note that \(\theta \) can be multivariate. In the case of our study, we are interested in a single parameter of the distribution, namely the shape parameter \(\xi \) of the distribution of X.

Maximizing the likelihood function is mathematically equivalent to maximizing its logarithm. It is often more convenient to work with the ‘log-likelihood’ function defined as follows:

Definition 6

Given a random variable X with parameter \(\theta \), the log-likelihood of \(\theta \) as a function of observations \(x_1, x_2, \ldots , x_{k}\) is defined as

$$\begin{aligned} {\mathcal {L}}(\theta |\, x_1, \ldots , x_{k}) = \ln L(\theta |\, x_1, \ldots , x_{k}). \end{aligned}$$

Definition 7

Given a random variable X with parameter \(\theta \), and a set of observations \(x_1, x_2, \ldots , x_{k}\), the Maximum Likelihood Estimator (MLE) of \(\theta \) is the value for which \(L(\theta |\, x_1, \ldots , x_{k})\) is maximized:

$$\begin{aligned} {\hat{\theta }}= & {} \mathop {\mathrm{argmax}}\limits _{\theta } L(\theta |\, x_1, \ldots , x_{k}) \, = \, \mathop {\mathrm{argmax}}\limits _{\theta } {\mathcal {L}}(\theta |\, x_1, \ldots , x_{k}). \end{aligned}$$

For convenience, when the sample \(x_1, x_2, \ldots , x_{k}\) is understood, we will denote the likelihood and log-likelihood of \(\theta \) by \(L(\theta )\) and \({\mathcal {L}}(\theta )\), respectively.

By differentiating the asymptotic expression of the distance distribution given in Eq. 1, we obtain the associated probability density function \(f_{\mathbf {X ,w }}\):

$$\begin{aligned} f_{\mathbf {X ,w }}(x) = \frac{F_{\mathbf {X ,w }}(w)\,\mathrm {ID}_{\mathbf {X}}}{w} \left( \frac{x}{w}\right) ^{\mathrm {ID}_{\mathbf {X}}-1}. \end{aligned}$$

Then, for a given sample of neighborhood distances \(x_1, x_2, \ldots , x_{k}\), we see that the log-likelihood of \(\mathrm {ID}_{\mathbf {X}}\) is given by

$$\begin{aligned} {\mathcal {L}}(\mathrm {ID}_{\mathbf {X}})= & {} \ln \left[ \prod _{i=1}^{k}f_{\mathbf {X ,w }}(x_i)\right] \\= & {} \ln \left[ \prod _{i=1}^{k} \frac{F_{\mathbf {X ,w }}(w)\,\mathrm {ID}_{\mathbf {X}}}{w} \left( \frac{x_i}{w}\right) ^{\mathrm {ID}_{\mathbf {X}}-1} \right] \\= & {} {k} \ln \frac{F_{\mathbf {X ,w }}(w)}{w} + {k} \ln \mathrm {ID}_{\mathbf {X}} + (\mathrm {ID}_{\mathbf {X}} - 1) \sum _{i=1}^{k} \ln \frac{x_i}{w}. \end{aligned}$$

The first- and second-order derivatives of the log-likelihood function are respectively

$$\begin{aligned} {\mathcal {L}}'(\mathrm {ID}_{\mathbf {X}}) \, = \, \frac{k}{\mathrm {ID}_{\mathbf {X}}} + \sum \nolimits _{i=1}^{k} \ln \frac{x_i}{w} \quad and \quad {\mathcal {L}}''(\mathrm {ID}_{\mathbf {X}}) \, = \, - \frac{k}{\mathrm {ID}_{\mathbf {X}}^{2}} \,. \end{aligned}$$

Accordingly, the maximum-likelihood estimate \({\widehat{\mathrm {ID}}}_{\mathbf {X}}\) is

$$\begin{aligned} {\widehat{\mathrm {ID}}}_{\mathbf {X}} = {-\left( \frac{1}{k} \sum \nolimits _{i=1}^{k} \ln \frac{x_i}{w} \right) ^{-1}}, \end{aligned}$$

which follows the form of the well-known Hill estimator for the scaling exponent of a power-law tail distribution (Hill 1975).

The variance is asymptotically given by the inverse of the Fisher information, defined as

$$\begin{aligned} I = \mathsf {E}\left[ {- \frac{\partial ^{2} {\mathcal {L}}(\mathrm {ID}_{\mathbf {X}})}{ \partial \, \mathrm {ID}_{\mathbf {X}}^{2}}} \right] = \frac{k}{\mathrm {ID}_{\mathbf {X}}^{2}}, \end{aligned}$$

where \(\mathsf {E}[\cdot ]\) denotes the expectation. Therefore, if the number of samples k is sufficiently large, we have \({\widehat{\mathrm {ID}}}_{\mathbf {X}} \sim {\mathcal {N}}(\mathrm {ID}_{\mathbf {X}},\mathrm {ID}_{\mathbf {X}}^{2} / \, {k})\). Accordingly, with probability \(1 - \beta \), a sample of k distances in [0, w) provides an estimate \({\widehat{\mathrm {ID}}}_{\mathbf {X}}\) lying within

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}} \pm \frac{\mathrm {ID}_{\mathbf {X}}}{\sqrt{k}} {\varPhi }^{-1} \left( 1 - \frac{\beta }{2} \right) . \end{aligned}$$

In other words, the \(1 - \beta \) confidence interval is

$$\begin{aligned} \left[ \frac{{\widehat{\mathrm {ID}}}_{\mathbf {X}}}{1 + {k}^{-1 / 2} {\varPhi }^{-1}( 1 - \beta / 2 )}, \frac{{\widehat{\mathrm {ID}}}_{\mathbf {X}}}{1 - {k}^{-1 / 2} {\varPhi }^{-1}( 1 - \beta / 2 )} \right] . \end{aligned}$$

4.2 Method of moments

For any choice of \({m} \in {\mathbb {N}}\), the m-th order non-central moment \(\mu _{m}\) of the random distance \(\mathbf {X}\) is

$$\begin{aligned} \mu _{m}&= \mathsf {E}\left[ \mathbf {X}^{m} \right] = \int _{x = 0}^{w} x^{m} f_{\mathbf {X}}(x) \, \mathrm {d}x = w^{m} \frac{\mathrm {ID}_{\mathbf {X}}}{\mathrm {ID}_{\mathbf {X}} + m}. \end{aligned}$$

Solving for the intrinsic dimension gives

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}} = -m \frac{\mu _{m}}{\mu _{m} - w^{m}} = g \left( \frac{\mu _{m}}{w^{m}} \right) , \end{aligned}$$

with \(g(x) = {m}\frac{x}{1-x}\). When estimating the order-m moment by its empirical counterpart \({\hat{\mu }}_{m} = \frac{1}{k} \sum _{i=1}^{k} x_i^{m}\), we see that \(\mathsf {E}[{\hat{\mu }}_{m}] = \mu _{m}\) and \( \mathsf {E}[{\hat{\mu }}_{m}^2] = (k \mu _{2m} + k(k-1) \mu _{m}^2){k}^{-2} \), so that

$$\begin{aligned} \mathsf {Var}[{\hat{\mu }}_{m}^2] = \frac{\mu _{2m} - \mu _{m}^2}{k} = \frac{w^{2m} \mathrm {ID}_{\mathbf {X}} {m}^2}{k (\mathrm {ID}_{\mathbf {X}} + 2m) (\mathrm {ID}_{\mathbf {X}} + {m})^2}. \end{aligned}$$

Assuming the convergence of the empirical moments, the distribution of \(\frac{{\hat{\mu }}_{m}}{w^{m}}\) is therefore asymptotically normal, with

$$\begin{aligned} \frac{{\hat{\mu }}_{m}}{w^{m}} \sim {\mathcal {N}}\left( \frac{\mathrm {ID}_{\mathbf {X}}}{\mathrm {ID}_{\mathbf {X}} + m}; \frac{\mathrm {ID}_{\mathbf {X}} {m}^2}{k (\mathrm {ID}_{\mathbf {X}} + 2m) (\mathrm {ID}_{\mathbf {X}} + m)^2} \right) . \end{aligned}$$

According to Rao (2009, Th. 6a2.9), if \(x \sim {\mathcal {N}}(\mu ; \sigma ^2 {k}^{-1})\) asymptotically, then \(g(x) \sim {\mathcal {N}}(g(\mu ); \sigma ^2{k}^{-1}g'(\mu )^2)\), where \(g'\) is the first-order derivative of g. Therefore, asymptotically

$$\begin{aligned} {\widehat{\mathrm {ID}}}_{\mathbf {X}} \sim {\mathcal {N}}\left( \mathrm {ID}_{\mathbf {X}}; \frac{\mathrm {ID}_{\mathbf {X}}^2}{k} \left( 1 + \frac{({m}/\mathrm {ID}_{\mathbf {X}})^2}{\mathrm {ID}_{\mathbf {X}}^2 (1 + 2{m}/\mathrm {ID}_{\mathbf {X}})} \right) \right) . \end{aligned}$$

This variance is monotonically increasing in \({m} / \mathrm {ID}_{\mathbf {X}}\), which indicates that we should use moments of small order m. When \({m} / \mathrm {ID}_{\mathbf {X}}\) tends to zero, the variance converges to \( {\mathrm {ID}_{\mathbf {X}}^2} / {k}\), the variance of the maximum-likelihood estimator (see Sect. 4.1). Note that an upper bound on \(\mathrm {ID}_{\mathbf {X}}\) implies that the variance is bounded. In this case we can derive confidence intervals similar to Sect. 4.1.

4.3 Probability-weighted moments

General probability-weighted moments are defined as

$$\begin{aligned} {\nu }_{{m},{p},{q}} = \mathsf {E}\left[ F_{\mathbf {X}}(\mathbf {X})^{m}(1-F_{\mathbf {X}}(\mathbf {X}))^{p} \mathbf {X}^{q} \right] . \end{aligned}$$

We restrict here our attention to a subfamily: for any choice of \({m} \in {\mathbb {N}}, \nu _{m}\) is defined as

$$\begin{aligned} \nu _{m}&\triangleq \mathsf {E}\left[ F_{\mathbf {X}}(\mathbf {X})^{m} \mathbf {X} \right] = \int _{x = 0}^{w} F_{\mathbf {X}}(x)^{m} x f_{\mathbf {X}}(x) \, \mathrm {d}x \, = \frac{\mathrm {ID}_{\mathbf {X}} \, w}{\mathrm {ID}_{\mathbf {X}} \, {m} + \mathrm {ID}_{\mathbf {X}} + 1}; \end{aligned}$$

solving for the intrinsic dimension yields

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}} = \frac{\nu _{m}}{w - \nu _{m} ({m} + 1)} = h \left( \frac{\nu _{m}}{w} \right) , \quad where \quad h(x) = \frac{x}{1 - ({m}+1)x}\,. \end{aligned}$$

According to Hosking and Wallis (1987) and Landwehr et al. (1979), a commonly-used estimator of the m-th probability-weighted moment of this form is

$$\begin{aligned} {\hat{\nu }}_{m} = \frac{1}{k} \sum _{i=1}^{k} \left( \frac{i - 0.35}{k} \right) ^{m} x_i. \end{aligned}$$

Analogously to the previous section, we can show that this estimator has variance

$$\begin{aligned} \mathsf {Var}[{\hat{\nu }}_{m}] = \frac{\mathrm {ID}_{\mathbf {X}} \, w^2}{(\mathrm {ID}_{\mathbf {X}} \, {m} + \mathrm {ID}_{\mathbf {X}} + 1) (2 \, \mathrm {ID}_{\mathbf {X}} \, {m} + \mathrm {ID}_{\mathbf {X}} + 2)}. \end{aligned}$$

Similarly, we find that asymptotically

$$\begin{aligned} {\widehat{\mathrm {ID}}}_{\mathbf {X}} \sim {\mathcal {N}} \left( \mathrm {ID}_{\mathbf {X}}; \frac{\mathrm {ID}_{\mathbf {X}}^2}{k} \left( 1 + \frac{(\mathrm {ID}_{\mathbf {X}} {m} + 1)^2}{\mathrm {ID}_{\mathbf {X}} (2 \mathrm {ID}_{\mathbf {X}} {m} + \mathrm {ID}_{\mathbf {X}} + 2)} \right) \right) . \end{aligned}$$

For \({m} = 0\), the variance is equivalent to that of the moment-based estimator with \({m} = 1\) (see Sect. 4.2). Since the variance increases monotonically with m for any fixed \(\mathrm {ID}_{\mathbf {X}}\), the use of lower-order probability-weighted moments is advisable.

4.4 Estimation using regularly varying functions

In this section we introduce an ad hoc estimator for the intrinsic dimensionality based on the characterization of distribution tails as regularly varying functions (as discussed in Sect. 3). Consider the empirical distribution function \({\hat{F}}_{\mathbf {X}}\), defined as

$$\begin{aligned} {\hat{F}}_{\mathbf {X}}(x) = \frac{1}{k} \sum \nolimits _{j=1}^{k} \llbracket x_j < x \rrbracket , \end{aligned}$$

where \(\llbracket \varphi \rrbracket \) refers to the Iverson bracket, which evaluates to 1 if \(\varphi \) is true, and 0 otherwise. We propose the following estimator for the index \(\kappa \) of \(F_{\mathbf {X}}\).

Definition 8

Let \(\mathbf {X}\) be an absolutely continuous random distance variable restricted to [0, w). The local intrinsic dimension \(\mathrm {ID}_{\mathbf {X}}\) can be estimated as

$$\begin{aligned} {\widehat{\mathrm {ID}}}_{\mathbf {X}} = {\hat{\kappa }} = \frac{\sum _{j=1}^{J} \alpha _j \ln \left[ {\hat{F}}_{\mathbf {X}}((1 + \tau _j \delta _{k})x_{k}) / {\hat{F}}_{\mathbf {X}}(x_{k}) \right] }{\sum _{j=1}^{J} \alpha _j \ln ( 1 + \tau _j \delta _{k} )}, \end{aligned}$$

under the assumption that \(x_{k}, \delta _{k} \rightarrow 0\) as \(n \rightarrow \infty \), where \((\alpha _j)_{1 \le j \le J}\) and \((\tau _j)_{1 \le j \le J}\) are sequences.

We will refer to this family of estimators as RV, for ‘regularly varying’. Note that since RV estimators involve only the products \(\tau _j \delta _{k}\) for \(1 \le j \le J\), we may assume without loss of generality that \(\tau _1 + \cdots + \tau _J = 1\). The estimators are based on the observation that, for all \(1 \le j \le J\),

$$\begin{aligned}&\ln \left[ F_{\mathbf {X}}((1 + \tau _j \delta _{k})x_{k}) / F_{\mathbf {X}}(x_{k}) \right] \\&\quad = \kappa \ln ( 1 + \tau _j \delta _{k} ) + \ln \left[ \ell _{\mathbf {X}}((1 + \tau _j \delta _{k}) x_{k}) / \ell _{\mathbf {X}}(x_{k}) \right] \\&\quad \simeq \kappa \ln (1 + \tau _j \delta _{k}). \end{aligned}$$

The RV family covers several of the known local estimators of intrinsic dimensionality. For the parameter choices \(J = 1\) and \( \epsilon = \tau \delta _{k}\), the RV estimator reduces to the GED formulation proposed in Houle et al. (2012a):

$$\begin{aligned} {\widehat{\mathrm {ID}}}_{\mathbf {X}} = \frac{\ln \left[ {\hat{F}}_{\mathbf {X}}((1 + \epsilon )x_{k}) / {\hat{F}}_{\mathbf {X}}(x_{k}) \right] }{\ln ( 1 + \epsilon )}. \end{aligned}$$

By setting \( \epsilon = 1\), Karger & Ruhl’s expansion dimension is obtained, while by setting \(x_{k}\) as the distance to the k-nearest neighbor and \(\epsilon \) such as \((1 + \epsilon )x_{k}\) as the distance to the nearest neighbor, we find a special case of the MiND family (\(\mathrm {MiND}_{ml1}\)) (Rozza et al. 2012).

Alternatively, by setting \(J = {k}, \alpha _i = 1\) for all \(i \in [1..{k}] \), and choosing the vector \(\tau \) such that \(1+\tau _i \delta _{k} = \frac{x_i}{x_{k}} \), the RV estimator becomes

$$\begin{aligned} {\widehat{\mathrm {ID}}}_{\mathbf {X}} = \frac{\sum _{j=1}^{k} \ln \left[ j / {k} \right] }{\sum _{j=1}^{k} \ln \left[ x_j / x_{k} \right] } \approx \frac{\ln {\sqrt{2\pi {k}}} - {k}}{\sum _{j=1}^{k} \ln \left[ x_j / x_{k} \right] }. \end{aligned}$$

As \({k} \rightarrow \infty \), this converges to the MLE (Hill) estimator presented in Sect. 4.1, with \(w=x_{k}\).

We now turn our attention to an analysis of the variation of RV estimators. First, we introduce an auxiliary function which drives the speed of convergence of the estimator proposed in Definition 8. For \(x \in {\mathbb {R}}\) let \(\varepsilon _{\mathbf {X}}(x)\) be defined as

$$\begin{aligned} \varepsilon _{\mathbf {X}}(x) \triangleq \frac{x \ell _{\mathbf {X}}'(x)}{\ell _{\mathbf {X}}(x)}. \end{aligned}$$

In Alves et al. (2003b, 2003a), the auxiliary function is assumed to be regularly varying, and the estimation of the corresponding regular variation index is addressed. Within this article, so as to prove the following results, we limit ourselves to the assumption that \(\varepsilon _{\mathbf {X}}\) is ultimately non-increasing.

Theorem 5

Let \(\mathbf {X}\) be a random distance variable over [0, w) with distribution function \(F_{\mathbf {X}}(x) = x^{\kappa }\ell _{\mathbf {X}}(1/x)\), and let \(\tau _{\max } \triangleq \max _{1 \le j \le J} \tau _j\). Furthermore, let \(\delta _{k}, x_{k} \rightarrow 0\) so that \({k} \, F_{\mathbf {X}}(x_{k})\delta _{k} \rightarrow \infty \) and \(\sqrt{{k} F_{\mathbf {X}}(x_{k}) \delta _{k}} \varepsilon _{\mathbf {X}}(1 / [(1 + \tau _{\max } \delta _{k})x_{k}]) \rightarrow 0\) as k approaches infinity. If the auxiliary function \(\varepsilon _{\mathbf {X}}\) is ultimately non-increasing, then \(\sqrt{{k} F_{\mathbf {X}}(x_{k}) \delta _{k}} \cdot [\mathrm {ID}_{\mathbf {X}} - {\widehat{\mathrm {ID}}}_{\mathbf {X}}]\) converges to a centered Gaussian with variance

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}} V_{\alpha ,\tau } = \mathrm {ID}_{\mathbf {X}} \frac{\alpha ^{\top } S \alpha }{\left( \alpha ^{\top }\tau \right) ^2}, \end{aligned}$$

where \(S_{a,b} = (|\tau _a| \wedge |\tau _b|) \llbracket \tau _a \tau _b > 0 \rrbracket \) for \((a,b) \in \{ 1, \dots , J \}^2\). (\(A \wedge B\) denotes the minimum of A and B.)

Note that the requirement \({k} F_{\mathbf {X}}(x_{k}) \delta _{k} \rightarrow \infty \) can be interpreted as a necessary and sufficient condition for the almost sure presence of at least one distance sample in the interval \([x_{k}, (1 + \tau _j \delta _{k}) x_{k})]\). In addition, the condition

$$\begin{aligned} \sqrt{{k} F_{\mathbf {X}}(x_{k}) \delta _{k}} \varepsilon _{\mathbf {X}}(1/[r_{k}(1+\tau _{\max } \delta _{k})]) \rightarrow 0 \end{aligned}$$

enforces that the approximation bias \(\varepsilon _{\mathbf {X}}( 1 / [(1+ \delta _{k})x_{k}])\) is negligible compared to the standard deviation of the estimate, \(1 / \sqrt{{k} F_{\mathbf {X}}(x_{k}) \delta _{k}}\). We continue the analysis by proposing choices of \(\alpha \) that minimize the variance in Theorem 5.

Lemma 2

The weight vector \(\alpha = (\alpha _1, \dots , \alpha _J)^{\top }\) minimizing \(V_{\alpha ,\tau }\) is proportional to \(\alpha _0 = S^{-1}\tau = (1, 0, \dots , 0)^{\top }\), and the associated optimal variance is given by \(V_0(\tau ) = \left( \tau ^{\top } S^{-1} \tau \right) ^{-1}\).

Proof

The maximum of the Rayleigh functional \(\alpha ^{\top } \tau \tau ^{\top } \alpha \left( \alpha ^{\top } S \alpha \right) ^{-1}\) is known to be attained when \(\alpha \) is proportional to the eigenvector associated with the largest eigenvalue of \(S^{-1}\tau \tau ^{\top }\). Since \(S^{-1}\tau \tau ^{\top }\) is a rank-one matrix, the eigenvector corresponding to the unique non-zero eigenvalue is \(S^{-1}\tau \). Without any loss of generality, we permute the entries of the vector \(\tau \) such that \(\tau _a < \tau _b\) for all \(a < b\). Asymptotically, we have \(0< \tau _1< \cdots < \tau _J\). Noting that the first column of the matrix S is \((\tau _1, \tau _2, \ldots , \tau _J)^{\top }\), we can infer that the vector \((1, 0, \dots , 0)^{\top }\) is a solution of the equation \(S.\alpha _0 = \tau \). Since S is invertible, the solution \(\alpha _0\) must be unique. We therefore conclude that \(\alpha _0 = (1, 0, \dots , 0)^{\top }\). \(\square \)

For the case \(J = 1\), we see that \(\tau = (1)^{\top }\) and \(V_0(1) = 1\). This indicates that the GED minimizes the variance of estimation. However, different choices can be made regarding the weight vector \(\tau \) and regarding the criterion to use in order to optimize the choice of \(\alpha \). Minimizing variance is one choice explored in this paper, but other criteria can be used. In general, however, the following confidence interval holds for RV estimators:

Lemma 3

Let \(\beta \in (0,1)\), and assume that the assumptions of Theorem 5 hold with \(\alpha = S^{-1} \tau \). Let \(u_{\beta } = {\varPhi }^{-1}((1 + \beta ) / 2)\), where \({\varPhi }\) is the cumulative distribution function of the standard Gaussian distribution. Then

$$\begin{aligned} \mathrm {ID}_{\mathbf {X}} \pm u_{\beta } \left( {k} \delta _{k} V_0(\tau ) {\widehat{\mathrm {ID}}}_{\mathbf {X}} {\hat{F}}_{\mathbf {X}}(x_{k}) \right) ^{-1/2} \end{aligned}$$

are the boundaries of the asymptotic confidence interval of level \(\beta \) for \({\widehat{\mathrm {ID}}}_{\mathbf {X}}\).

Proof

Lemma 3 is a direct consequence of the asymptotic distribution established in Theorem 5 and the convergence of \({\hat{F}}_{\mathbf {X}}(x_{k})\) to \(F_{\mathbf {X}}(x_{k})\) as \({k} \rightarrow \infty \). \(\square \)

5 Experimental framework

As part of our evaluation of our estimators of local intrinsic dimension, we investigate their performance (as well as those of competing estimators) on a series of data distributions, both real and artificially generated. While trials involving real application data are primarily of practical interest, the study of artificial data allows to systematically assess the ability of the individual methods to identify data dimensionality.

5.1 Methods

The methods used in this study include MLE, MoM, PWM, and RV. For all estimators, the neighborhood size is set to \(k=100\). The RV estimators are evaluated for the choices \(J = 1\) and \(J = 2\), as follows:

$$\begin{aligned} {\widehat{\mathrm {ID}}}_{\textsc {RV}} = \left\{ \begin{array}{ll} \frac{\ln {k} - \ln ( {k} / 2 )}{\ln x_{k} - \ln x_{\lfloor {k} / 2 \rfloor }}, &{} if \quad J = 1 \\ \frac{\ln ( {k} / j ) - (p - 1) \ln ( i / j )}{\ln x_{k} / x_j + (p - 1) \ln x_i / x_j}, &{} if \quad J = 2, \end{array} \right. \end{aligned}$$

where \(p = (x_i - 2 x_j + x_{k}) / (x_{k} - x_j), i = \lfloor {k} / 2 \rfloor \), and \(j = \lfloor 3{k} / 4 \rfloor \). Note that the estimator RV for \(J = 1\) is a form of generalized expansion dimension (GED) (Houle et al. 2012a). For every dataset, we report the average of ID estimates across all the points in the dataset. All estimators in our study can be computed in time linear in the number of sample points.

Our experimental framework includes several state-of-the-art estimators of intrinsic dimensionality, both local and global. The global estimators consist of a projection method (PCA), fractal methods (CD Camastra and Vinciarelli 2002; Hein and Audibert 2005; Takens 1985), and graph-based methods (\(\text {kNNG}_{1}, \text {kNNG}_{2}\)Costa and Hero 2004). The local distance-based estimators are \(\text {MiND}_{ml1}\) and \(\text {MiND}_{mli}\) (Rozza et al. 2012). Table 1 summarizes the parameter choices for every method, except for the fractal methods, which do not involve any parameter.

The MiND variants makes more restrictive assumptions than our methods: they assume the data to be uniformly distributed on a hypersphere, with a locally isometric smooth map between the hypersphere and the representational space. MiND uses only the two extreme samples (smallest and largest), and requires knowledge of the dimension of the space (D). In contrast, our approach assumes only that the nearest neighbor distances are in the lower tail of the distance distribution, where EVT estimation can be performed.

5.2 Artificial distance distributions

In the following we propose a set of experiments concerning artificial data, and describe the method employed for the generation of test data.

First, consider a point \(\mathbf {P}\) drawn uniformly at random from within the d-dimensional unit sphere, for some choice of \({d} \in {\mathbb {N}}\). According to the method of normal variates, we define \(\mathbf {P} = \mathbf {Z}^{1 / {d}} \mathbf {Y} \Vert \mathbf {Y} \Vert ^{-1}\), where \(\mathbf {Z}\) is uniformly distributed on [0, 1], and \(\mathbf {Y}\) is a random vector in \({\mathbb {R}}^{d}\) whose coefficients follow the standard normal distribution. The distance of \(\mathbf {P}\), with respect to our choice of reference point at location \(0 \in {\mathbb {R}}^{d}\), is distributed as follows:

$$\begin{aligned} \mathbf {X} = \frac{\Vert \mathbf {Z}^{1 / {d}} \mathbf {Y} \Vert }{\Vert \mathbf {Y} \Vert } = \mathbf {Z}^{1 / {d}}. \end{aligned}$$

Note that, by measuring LID purely based on distance values with respect to a reference point, the model does not require that the data have an underlying spatial representation. As such, non-integer values of \(d \in {\mathbb {R}}\) can be selected for the generation of distances, if desired.

For choices of \(d \in \{ 1, 2, 4, 8, 16, 32, 64, 128 \}\), we draw 100 independent sequences of sample distance values from the distribution described above, and record the estimates produced by each of our methods for sample sizes n between 10 and \(10^{4}\).

Table 1 Parameter choices used in the experiments

5.3 Artificial data

The datasets used in our experiments have been proposed in Rozza et al. (2012). They consist of 15 manifolds of various stuctures and intrinsic dimensionalities (d) represented in spaces of different dimensions (D). They are summarized in Table 2.

Table 2 Artificial datasets used in the experiments

These datasets were generated in different sizes (\(10^3, 10^4\), and \(10^5\) points) in order to evaluate the effect of the number of points on the quality of the different estimators. For each dataset and for each of the three sizes, we average the estimates over 20 instances.

In order to evaluate the robustness of the estimators, we also prepared versions of these datasets with noise added. For each attribute f, we added normally-distributed noise with mean equal to zero and standard deviation \(\sigma _n = p\cdot \sigma _f\) where \(\sigma _f\) is the standard deviation of the attribute itself, and \(p \in \{0.01, 0.04, 0.16, 0.64\}\). For attributes with \(\sigma _f=0\), the noise was generated with standard deviation \(\sigma _n = p\cdot \sigma _f^*\) where \(\sigma _f^*\) is the minimum of the nonzero standard deviations over all attributes.

5.4 Real data

Not only can a reliable estimation of ID greatly benefit the practical performance of many applications (Karger and Ruhl 2002; Beygelzimer et al. 2006; Houle et al. 2012b), it also serves as a characterization of high-dimensional datasets and the potential problems associated with their use in practice. To this end, we investigate the distribution of LID estimates on the following datasets, each taken from a real-world application scenario.

  • The ALOI (Amsterdam Library of Object Images) data set contains a total of 110,250 color photos of 1000 different objects taken from varying viewpoints under various illumination conditions. Each image is described by a 641-dimensional vector of color and texture features (Boujemaa et al. 2001).

  • The ANN_SIFT1B dataset consists of one billion 128-dimensional SIFT descriptors randomly selected from the dataset ANN_SIFT, consisting of \(2.8 \times 10^{10}\) SIFT descriptors extracted from \(3 \times 10^{7}\) images. These sets have been created for the evaluation of nearest-neighbor search strategies at very large scales (Jégou et al. 2011).

  • BCI5 (Millán 2004) is a brain-computer interface dataset in which the classes correspond to brainwave readings taken while the subject contemplated one of three different actions (movement of the right hand, movement of the left hand, and the subvocalization of words beginning with the same letter).

  • Gisette (Guyon et al. 2004) is a subset of the MNIST handwritten digit image dataset (LeCun et al. 1998), consisting of 50-by-50-pixel images of the highly confusable digits ’4’ and ’9’. 2500 random features were artificially generated and added to the original 2500 features, so as to embed the data into a higher-dimensional feature space. As the dataset was created for the NIPS 2003 feature selection challenge, the precise generation mechanism of the random features was not made public.

  • Isolet (Cole and Fanty 1990) is a set of 7797 human voice recordings in which 150 subjects recite each of the 26 letters of the alphabet twice. Each entry consists of 617 features representing selected utterances of the recording.

  • The MNIST database (LeCun et al. 1998) contains of 70,000 recordings of handwritten digits. The images have been normalized and discretized to a \(28 \times 28\)-pixel grid. The gray-scale values of the resulting 784 pixels are used to form the feature vectors.

5.5 Approximate nearest neighbors

figure a

For many datasets, various approximate nearest neighbor (ANN) methods can generate neighborhood sets much faster than would be possible using an exact indexing method. As a rule, with ANN indexing methods it is possible to influence the trade-off between accuracy and time complexity by means of parameter choices at query time, design choices at construction time, or both. However, the use of approximate neighborhood information can lead to a degradation in the quality of data statistics that rely on it. In particular, the question arises as to how the quality of LID estimators are affected when applied to distance samples generated from approximate neighborhoods of diminishing accuracy. In this part of the experimental study, we investigate the relationship between the accuracy of neighborhood sets and the accuracy of LID estimates. Here, accuracy is measured as the proportion of distance samples in the exact neighborhood that also appear in the approximate neighborhood under consideration. Under the assumption that the exact and approximate neighborhoods all have the same size k, this notion of accuracy coincides with those of both recall and precision.

For any given dataset, we can generate approximate k-NN sets with carefully controlled levels of accuracy, through the sparsification of exact neighbor sets of size greater than k. The sparsification is done in two steps. In the first step, we randomly select a proportion of the exact k nearest neighbors at the desired level of accuracy. In the second step, we complete the new approximate list with nearest neighbors drawn from outside the exact k-NN list, in a way that the selection rate matches the accuracy. More precisely, let r be the target level of accuracy, expressed as a proportion between 0 and 1. Initially, the approximate neighborhood distance sample is constructed by randomly selecting \(\lfloor rk\rfloor \) elements of the approximate neighborhood (without replacement) from among the first k elements in the exact k-NN set. Next, an additional \(k-\lfloor rk\rfloor \) elements are randomly selected from among those ranked between \(k+1\) and \(K=\lceil k/r\rceil \) in the exact K-NN set, and add their distances to the sample. With this choice of K, the accuracy of the approximate k-NN query result is almost identical to that of the K-NN query result:

  • for neighbors ranked between 1 and \(\lfloor rk\rfloor \), the accuracy is \(\lfloor rk\rfloor /k\), where

    $$\begin{aligned} r \cdot \left( 1-\frac{1}{k}\right) < \frac{\lfloor rk\rfloor }{k} \le r, \end{aligned}$$
  • for neighbors ranked between 1 and \(\lceil k/r\rceil \), the accuracy is \(k/\lceil k/r\rceil \), where

    $$\begin{aligned} r \cdot \left( 1-\frac{1}{k+1}\right) < \frac{k}{\lceil k/r\rceil } \le r. \end{aligned}$$

As k increases, these upper and lower bounds converge to r.

In our experiments, to observe the effect of using ANN on LID values, we use MLE estimation with \(k=100\). The accuracy r is chosen from the range 0.5 to 1.0, since for these values, the maximum size of the exact neighborhoods required for the experimentation is a manageable \(2k = 200\).

5.6 Nearest neighbor descent

The computational and storage costs associated with the construction of an exact k-nearest neighbor graph (similarity graph) is a limitation in many machine learning algorithms. Particularly in high-dimensional settings, the cost of generating all exact k-nearest neighbor lists can be quadratic in the number of data objects, which for large datasets can be prohibitively high. Many approximation methods exist for the construction of nearest neighbor (ANN) with computation costs much less than those of exact methods, though at the expense of accuracy.

We conducted an experiment to show that the process of obtaining the neighborhoods necessary for LID estimation can be considerably accelerated using a state-of-the-art ANN method, with little or no effect on LID estimates. From among the many ANN algorithms available, we chose the state-of-the-art Nearest Neighbor Descent (NN-Descent) (Dong et al. 2011) algorithm for our experimentation. The NN-Descent algorithm is based on the assumption of transitivity of the similarity measure—in other words, that two neighbors of a given data object are also likely to be neighbors of one another. As shown in the pseudo-code description of Algorithm 1, all points are initially associated with randomly built ‘k-NN lists’ which are then iteratively updated. At every iteration, a pivot element q is selected, and each possible pair (uv) of q’s neighbors is considered for mutual updates. If the distance \(\mathrm {dist}(u,v)\) is smaller than the distance to the last element in u’s k-NN list, then the list is updated by inserting v in the appropriate location. The same test is applied to the k-NN list of v. In addition, similar tests are applied to the reverse (inverted) k-NN list of q. The algorithm converges when a pivot selection round completes without updates are made to the k-NN lists. As recommended in the original paper 1, we modified the convergence condition so as to terminate after a maximum of 7 rounds of the loop in lines 5–10.

Fig. 1
figure 1

Comparison of the mean and standard deviation of LID estimates provided by MLE, MoM and RV (for \(J = 1\) and \(J = 2\)) on increasingly large samples drawn from artificially-generated distance distributions. The results cover target dimensionality values between 1 and 128. The values are marked in the corresponding plots. a \(\mathrm {ID}= 1\). b \(\mathrm {ID}= 2\). c \(\mathrm {ID}= 4\). d \(\mathrm {ID}= 8\). e \(\mathrm {ID}= 16\). f \(\mathrm {ID}= 32\). g \(\mathrm {ID}= 64\). h \(\mathrm {ID}= 128\)

6 Experimental results

6.1 Artificial distance distributions

We begin our experimental study with an assessment—in terms of bias, variance, and convergence—of the ability of each estimator to identify the ID of a sample of distance values generated according to different choices of target ID. Note that for these trials, the distributional model asserted in Lemma 1 holds everywhere on the range [0, w) by construction (with \(w = 1\)).

Figure 1 shows the behavior of MLE, MoM, and RV (for choices of \(J = 1\) and \(J = 2\), as stated in Sect. 5.1). The convergence to the target ID value observed in every case empirically confirms the consistency of these estimators. Likewise, PWM is consistent however, one should beware of PWM’s susceptibility to the effects of numerical instability.

We also note that the RV estimator with \(J = 1\) (GED)—which asymptotically minimizes variance according to Lemma 2—is not the choice that minimizes variance when the number of samples is limited. Faster initial convergence favors the choice of MLE and MoM for applications where the number of available query-to-neighbor distances is limited, or where time complexity is an issue.

6.2 Artificial data

In Tables 3 and 4, for each of the estimators considered in this study, we present ID estimates for the artifical datasets, averaged over 20 runs each. It should be noted that as PCA and \(\mathrm {MiND}_{mli}\) estimates are restricted to integer values, their bias is lower for examples having integer ground-truth intrinsic dimension, especially when this dimensionality is small. Also, unlike the other estimators tested, MiND estimators also require that an upper bound on the ID be 4supplied (set to D in these experiments). PCA requires a threshold parameter to be supplied, the value of which can greatly influence the estimation.

The experimental results indicate that local estimators tend to over-estimate dimensionality in the case of non-linear manifolds (sets m3, m4, m5, m6, m7, m8, m11 and m13) and to under-estimate it in the case of linear manifolds (sets m1, m2, m9, m10a, m10b, m10c and m12). The experimental results with higher sampling rates confirm the reduction in bias that would be expected with smaller k-nearest-neighbor distances, as the local manifold structure more closely approximates the tangent space.

Table 3 Average ID estimates for 1000-point-manifolds using 100 nearest neighbors
Table 4 Average ID estimates for 10,000-point-manifolds using 100 nearest neighbors

For highly non-linear manifolds, such as the Swiss Roll (m7) or the Möbius band (m11), global estimators have difficulty in identifying the intrinsic dimension. As one might expect, the local estimators ID and MiND are more accurate for such cases. Although high local curvature is reflected in the distance distribution, and consequently the local dimensional estimates as well, the effect is much smaller than for global estimators. With a higher sampling rate, k-nearest neighbor distances are diminished, and the curvature becomes locally less significant. The local manifold structure tends to that of its tangent space, reducing the bias of local estimation. We also note that the bias is proportional to the intrinsic dimensionality of the manifold. As dimensionality increases, a higher sampling rate is required in order to reduce the bias.

Table 5 Deviation of ID estimates for 10,000-point-manifolds with added noise (\(p=0.01\)) using 100 nearest neighbors

To show the effects of noise on the estimators, we display in Tables 56 and  7 for each method the deviation of every estimate in the presence of noise as a proportion of the estimate obtained in the absence of noise. On the one hand, we note that global methods, k-NNG in particular, are significantly affected by noise: their estimates diverge very quickly as noise is being introduced. On the other hand, the local estimators display more resistance to noise in the case of non-linear manifolds; among the local estimators, our EVT estimators tend to outperform the MiND variants.

We note that the additive noise considered in this experiment does not drastically impact the intrinsic dimensionality in the case of hypercubes. (sets m10a, m10b and m10c). That explains why PCA appears resistant to noise for the sets m10a, m10b and m10c. However, noise in these manifolds may drive points far from their original positions, which may explain the relatively high estimates obtained by local intrinsic dimensionality estimators on these sets.

Table 6 Deviation of ID estimates for 10,000-point-manifolds with added noise (\(p=0.04\)) using 100 nearest neighbors
Table 7 Deviation of ID estimates for 10,000-point-manifolds with added noise (\(p=0.16\)) using 100 nearest neighbors

The robustness of local estimation is of great importance for many applications such as search and outlier detection. The resistance to noise seems to be generally higher in the case of manifolds of higher intrinsic dimensionality. It is important that our estimates can be trusted on these complex manifolds where the concentration effect is more important. In datasets of smaller intrinsic dimensionality, our noise model raises the dimensionality agressively which does not happen very often in real world situations.

6.3 Real data

Based on our experiments on synthetic data, we expect the performance of our proposed estimators to be largely in agreement with one another. Accordingly, for clarity of presentation, for the experimentation on real data, we show results only for the MLE estimator.

For each of the datasets considered in this study, Fig. 2 illustrates the distribution of LID estimates based at reference points drawn from the data. Due to its large size, for the ANN_SIFT1B dataset, the reference set was generated by selecting \(10^4\) items uniformly at random. For the other datasets, the entire dataset was used as the reference set. We observe clear differences in the distribution of LID values among the datasets; for example, the center and spread of the LID estimates for ALOI are considerably lower than those obtained for the other datasets, whereas the LID estimates for Gisette are clearly higher. More precisely, we observe mean values of \(\mu _{\textsc {ALOI}} = 4.4, \mu _{\textsc {ANN\_SIFT1B}} = 12.3\), and \(\mu _{\textsc {Gisette}} = 49.4\). with the corresponding standard deviations of \(\sigma _{\textsc {ALOI}} = 3.5\), \(\sigma _{\textsc {ANN\_SIFT1B}} = 3.0\), and \(\sigma _{\textsc {Gisette}} = 12.4\). It should be noted that the measured ID within the neighborhoods that were tested is far smaller than the dimension of the full feature spaces. By plotting the same data as histograms in Fig. 3, we can furthermore see that the individual distributions of LID values differ in kurtosis and skewness as well.

Figure 3 shows that the LID estimates for the Gisette dataset are very high compared to those of the other 5 sets. In particular, they are much higher than the LID values for MNIST, the original data set from which Gisette was constructed. It is clear from the LID histograms that the addition of artificial noise features in Gisette drastically inflates the LID values in the dataset, revealing that the generation mechanism underlying these noise features is very different from that of real-world datasets. Although this generation mechanism was not revealed by the creators of Gisette, local intrinsic dimension—as a measure of the subspace-filling capacity of the data—is capable of differentiating between artificial noise and natural noise.

Fig. 2
figure 2

Plots of the distribution of LID values across each dataset. The LID values were obtained using the MLE estimator on the size-100 neighborhoods of the individual reference points

Fig. 3
figure 3

Histograms of LID values across each dataset, obtained using the MLE estimator on the size-100 neighborhoods of the individual reference points

For the ANN_SIFT1B dataset, from among the points of interest highlighted in the scatter plot in Fig. 2, AB and C correspond to the objects for which the three lowest LID values have been estimated (\(\mathrm {ID}_{A} \approx 2.8, \mathrm {ID}_{B} \approx 3.1\), and \(\mathrm {ID}_{C} \approx 2.4\)). Likewise, the objects corresponding to DE and F achieved the three greatest \(\mathrm {ID}\) values at \(\mathrm {ID}_{D} \approx 31.5, \mathrm {ID}_{E} \approx 30.1\), and \(\mathrm {ID}_{F} \approx 25.7\). The object G has been chosen as its associated dimensionality estimate (\(\mathrm {ID}_{G} \approx 12.3\)) is closest to the mean. We subsequently investigated the distribution of distances in the neighborhoods of these points so as to gain a better understanding of why the corresponding dimensionality estimates take such low, high, or average values.

The most striking difference between the individual points of interest are the distances to their respective k-nearest neighbors. Figure 4a displays for each point of interest the specific distribution of neighbor-distances for all values of k between 1 and 1000. Interestingly, the ID measured at the points of interest appears to be associated with other properties of the respective objects. For example, distribution of neighbor-distances for objects with high corresponding dimensionality (DE and F) indicate that these points are in some sense outliers. On the other hand, despite their distance distributions being quite dissimilar, the LID values measured at AB, and C are nearly identical.

Fig. 4
figure 4

Distribution of \(\mathrm {ID}_{\mathrm {MLE}}\) estimates and distance values across neighborhoods around the points of interest. a Illustration of the distribution of k-nearest neighbor distances for \(k \in [1,1000]\) with respect to 7 points of interest. b Distribution of LID estimates based on k-nearest neighbor sets for \(k \in [10,1000]\) with respect to 7 points of interest

6.4 Approximate nearest neighbors

This set of experiments shows that using approximate neighbors reduces the overall computation time of LID at the cost of an increase in bias. In an approximate k-NN query result, only a certain proportion of the observed distance values (equal to the accuracy of the result) correspond to distance values associated with members of an exact k-NN result. The distances associated with the approximate result can be regarded as having been generated by first sampling the dataset, and taking the distance values associated with the exact k-NN set with respect to the sample. The bias of the LID estimates for the approximate neighborhood can therefore be regarded as the result of a sparsification of the available distance information.

Table 8 Average ID (MLE) estimates and their standard deviation for 1000-point manifolds using 100 approximate nearest neighbors with controlled recall
Table 9 Average ID (MLE) estimates and their standard deviation for 10,000-point manifolds using 100 approximate nearest neighbors with controlled recall

The results presented in Tables 8 and 9 show that using distances drawn from approximate neighborhoods does not lead to significant changes in estimated LID values, provided that the accuracy of the neighborhoods is reasonably high. In fact, for the datasets studied, the change in estimated LID values did not exceed 18% of the ground truth intrinsic dimension in the worst case, even with a neighborhood accuracy of 50%.

We observe that for each of the datasets, the observed bias is inversely proportional to the neighborhood accuracy: a higher accuracy always corresponds to a lower bias, although the relationship is not linear. We also observe that the sign of the bias depends on the curvature of the underlying manifolds within which the datasets are distributed. This trend is clear even when only 1000 points were generated within the manifolds (see Table 8). The bias is positive for the non-convex sets (m4, m5, m7, and m13). For these sets of high curvature, distance sparsification has a proportionally greater effect on the smaller distances, as compared to when the manifolds are linear. When the loss of instances of smaller distance values is higher than for larger distance values, the estimates of LID would be expected to rise.

It is important to note that estimation over neighborhoods of size 100 within a dataset of size 1000 is not in line with the asymptotic assumptions of EVT, since the neighborhood here can hardly be viewed as being derived from an extreme lower tail of the underlying distribution. However, estimation over neighborhoods of size 100 within a dataset of size 10,000 would be expected to lead to more stable results, due to the much smaller ratio of the neighborhood set size to the full dataset size. This is borne out by the experimental results shown in Table 8, where it can be seen that the approximation of neighborhood distance values has very little effect on the quality of ID estimation.

Fig. 5
figure 5

Execution time and accuracy of NN-Descent compared with exact nearest neighbors’ computation for 20 runs on the 10,000-point datasets

Fig. 6
figure 6

Execution time and accuracy of NN-Descent compared with exact nearest neighbors’ computation for 20 runs on the 100,000-point datasets

For the artificial datasets, as a representative ANN method, NN-Descent achieves extremely high accuracies while achieving useful speedups over sequential search (especially for the larger datasets). As seen in Figs. 5 and 6, average accuracies range between 99.9982 and 100%, while average execution costs range between 3 and 8 times faster than exact k-NN computation time for sets of 10,000 points, and between 15 and 41 times faster than exact k-NN computation time for sets of 100,000 points. Under these conditions, the LID estimates for all artificial datasets included in this experiment remain unchanged. For the datasets of size 1000 or less, the execution cost of NN-Descent is dominated by the overheads associated with the underlying data structures. However, as shown in Fig. 6 for datasets of 100,000 points, the benefit of estimating LID with approximate neighborhoods quickly becomes apparent as the dataset size rises.

For the real-world datasets, NN-Descent achieves very high accuracies as well, while achieving important speedups over exact nearest neighbor computation. Average accuracies in all cases were at least 94.5%, as can be seen from Table 10. On the small datasets, NN-Descent accelerates the computation of nearest neighbors by no more than a factor of 2. For these small datasets, the time gain is limited by the overheads in maintaining the data structures required for NN-Descent. On the large datasets of this study, approximate nearest neighbors are obtained in up to 151 times faster than exact nearest neighbors. Due to the high accuracy of neighborhoods, LID estimates remain essentially unchanged for all datasets except for ANN_SIFT1B, where they deviate by only \(-\,1.82\%\) from their original values. For most machine learning applications, such small changes in LID values would likely have little or no impact on the usefulness of the estimates.

Through these experiments, we can conclude that the use of approximate nearest neighbor computation allows LID estimation to be effectively applied at large scales. LID estimation can therefore be a viable option even for those machine learning and data mining applications where scalability is an important issue.

Table 10 Effect of using NN-descent

7 Conclusion

Our experimental results on synthetic data show that for all of the estimators of LID that we propose, the estimation stabilizes for sample sizes on the order of 100. However, for Theorem 3 to be applicable, one must set a sufficiently small threshold on the lower tail of the distribution, which may severely limit the number of samples falling within the tail. Although there is a conflict between the accuracy of the estimator and the validity of the model, this conflict is resolved as the size of the dataset scales upward; it is in precisely such situations where the applications of ID have the most impact.

For situations where exact neighborhood information is impractical to compute, our experimental results show that LID estimation is effective even when only approximate neighborhood information is available. Consequently, learning machines that exploit LID values need not suffer from the high computational cost associated with the computation of exact neighborhoods.

Estimates of local ID constitute a measure of the complexity of data. Along with other indicators such as contrast (Shaft and Ramakrishnan 2006), LID could give researchers and practitioners more insight into the nature of their data, and therefore help them improve the efficiency and efficacy of their applications. As a tool for guiding learning processes, the proposed estimators could serve in many ways. Data collected during the retrieval processes could be automatically filtered out as noise, whenever they are associated with an unusually high ID value. In this way, the quality of query results may be enhanced as well.

The performance of content-based retrieval systems is usually assessed in terms of the precision and recall of queries on a ground truth dataset. However, in high-dimensional settings it is often the case that some points are much less likely to appear in a query result than others. Unlike LID, conventional measures of complexity or performance do not account for this difficulty. LID has therefore the potential to aid in the design of fair benchmarks that truly reflect the power of retrieval systems, according to a sound, mathematically-grounded procedure.