1 Introduction

1.1 Probability Density and Dimensionality

Modelers in computer science disciplines such as artificial intelligence, machine learning, and pattern recognition typically base their analyses on a distributional view of their data sources, either in terms of parametric approaches involving standard distributions, or through nonparametric approaches.

For applications in which the data is continuous in nature, and in the supporting domains of continuous statistical modeling and probability theory, the concept of probability density is ubiquitous. In continuous domains, the absolute probability associated with any single event is typically zero, due to the infinite number of possible outcomes. For this reason, theoreticians and practitioners alike have been concerned with the relative likelihood of generation of one given sample value compared to another, as taken over infinitesimally-small volumes of positive probability measure that contain the values of interest. These density values, when integrated over the domain, account for its full probability measure, which is 1 by definition.

Not all distributions of interest admit a probability density function (pdf). Perhaps the most famous and important example is that of mixture distributions derived from a collection of random variables. Sampling from a mixture distribution involves a two-stage process, in which a distribution is first selected from the mixture according to some probability associated with it, and then a value is generated from the selected distribution. Although each of the constituent distributions of the mixture can be associated with a probability density function, in general the mixture distribution as a whole may not. As an illustrative example, consider the situation in Fig. 1 in which with equal probability a sample can be selected from one of three uniform distributions: over a 1-dimensional unit interval, over a 2-dimensional unit square, or over a 3-dimensional unit cube. Each of the three pdfs of the mixture, when integrated over its own domain (interval, square, or cube), would return a value of 1. However, taking the 2-D pdf of the square domain and integrating it over the 1-D interval domain would produce a result of 0, whereas its integration over the 3-D cube would diverge to \(+\infty \). Accordingly, it is not possible to devise a single probability density function capable of representing this mixture distribution as a whole.

Fig. 1.
figure 1

Application of a 2-D pdf in uniform distributions in 1-D, 2-D, and 3-D domains. A 2-D pdf expresses probability density as a ratio between probability mass and area. Since the 1-D interval has area 0, and since the 3-D cube has infinite area, integration of the 2-D pdf over the interval and the cube (in terms of the infinitesimal product \(\mathrm {d}y\,\mathrm {d}x\)) would produce the values 0 and \(+\infty \), respectively.

Standard techniques for the direct estimation of probability density (such as multivariate kernel density estimation  [35]) can be problematic when the data dimensionality is assumed to be large and fixed. As the example in Fig. 1 shows, standard formulations of probability density as an integrand over the distribution domain explicitly depend on the dimensionalities of the infinitesimal volumes over which the integration is performed. In settings for which the numbers of features is very large, data analysis models that rely on forms of density estimation can exhibit bias in terms of the number of features that are relevant to individual localities within the domain, and the degree of correlation and other interactions among these features. Variations in the numbers of relevant features from locality to locality within the domain, together with vast differences between the (local) relevant feature dimension and the total (global) feature dimension, can greatly degrade the effectiveness of the distributional model in data analysis applications.

1.2 Volume, Dimensionality, and Similarity

Although direct estimation of probability density can be problematic, data analysis that relies on a comparison between two probability densities can (explicitly or implicitly) benefit from modeling in terms of their ratios. In recent years, density ratio estimation has been well-studied as an important area in its own right  [34], and has been widely adopted throughout AI and machine learning. In addition to being used to compare the relative concentration of data between two locations within a common distribution, it has also served as the foundation of popular measures of the similarity between two data distributions on compatible domains, such as the Kullback-Leibler (KL) divergence  [27] and other f-divergences  [28]. Although direct estimation of ratios has been shown to be inherently biased for small sample sizes  [9, 33], division of one probability density by another has the advantage of producing a unitless quantity independent of local dimensional assumptions. Nevertheless, it should be noted that KL-divergence and other distributional similarity measures typically integrate these ratios of probability density over the entire domain (of one of the two distributions), thereby reintroducing fixed-dimensional infinitesimal volumes into the data model.

The domains of continuous distributions are typically equipped with a similarity (or dissimilarity) measure through which the interrelationships among data are modeled and assessed. In many situations in machine learning/and data mining, modelers and practitioners tend to view similarity-based density estimation as an acceptable proxy for volume-based density estimation. However, in general this is not the case: volume tends to increase not linearly with distance radius, but with that radius raised to the power of the data dimension.

In this paper, we explicitly acknowledge this disparity between radius and volume by investigating the issue of density ratios from the perspective of the recent Local Intrinsic Dimensionality (LID)  [15, 16] model. LID is a distributional form of intrinsic dimensional modeling in which the volume of a ball of radius r is taken to be the probability measure associated with its interior, denoted by F(r). The function F can be regarded as the cumulative distribution function (cdf) of an underlying distribution of distances.

Theoretical properties of LID in multivariate analysis have been studied recently [17]. LID and related expansion-based measures of intrinsic dimensionality  [18] have also seen applications in such areas as similarity search  [3, 8, 19,20,21,22], dependency analysis  [31], feature selection and ranking  [4, 13, 23], outlier detection and its analysis  [11, 24, 32], and deep learning  [29, 30]. Practical estimators of LID have been developed based on interpoint distances within neighborhood samples  [1, 2], which trace their roots to the estimation of the generalized pareto distribution (GPD) shape parameter  [14] developed within the statistical discipline of extreme value theory (EVT).

1.3 Overview

The goal of the paper is to introduce a theoretical foundation that can serve for modeling density ratios in terms of LID, without recourse to the traditional formulation of probability density. After first briefly surveying of the LID model in the next section, in Sect. 3 we present several results relating the limits of functions with properties appropriate to density ratio modeling: namely, the functions are assumed to be smooth (continuously differentiable), positive, vanishing at zero, and of the form of a cdf of an induced distance distribution relative to a reference location in a distribution of interest. In each theoretical statement, the local intrinsic dimensionality is shown to influence the density ratio model in a natural way. The paper is then concluded in Sect. 4.

2 Local Intrinsic Dimensionality

In this section, we give a brief overview of the extreme-value-theoretic LID model of intrinsic dimensionality first introduced in  [15]. For more information on the model and its connections to the statistical theory of extreme values, please see [10, 16, 17].

The LID model falls into the expansion family of intrinsic dimensional estimation  [18, 26]. Like earlier expansion models, LID draws its motivation from the relationship between volume and radius in an expanding ball around points of interest. Unlike these models, the LID interprets volume as a function of the same form as a univariate cumulative distribution function (cdf), representing the probability measure F(r) captured by a ball of radius r. Although motivated by applications involving the distribution of distance values, the model formulation (as stated in  [16]) generalizes this notion even further, to any smooth real-valued function that is non-zero in the vicinity of \(r\ne 0\).

Definition 1

([16]). Let F be a real-valued function that is non-zero over some open interval containing \(r\in \mathbb {R}\), \(r\ne 0\). The intrinsic dimensionality of F at r is defined as follows whenever the limit exists:

Under the same assumptions on F, when F can be interpreted as the cdf of a distance distribution, the definition of LID can be regarded as equivalent to a notion of indiscriminability. Intuitively, if an underlying distance measure is indiscriminative at a given distance r, then expanding the distance by some small factor should incur a relatively large increase in probability measure as a proportion of the current value, F(r). Accordingly, the indiscriminability of the distance variable is defined as the limit of the ratio of two quantities: the proportional rate of increase of F(r), and the proportional rate of increase in r.

Definition 2

([16]). Let F be a real-valued function that is non-zero over some open interval containing \(r\in \mathbb {R}\), \(r\ne 0\). The indiscriminability of F at r is defined as follows whenever the limit exists:

When F satisfies certain smoothness conditions in the vicinity of r, its intrinsic dimensionality and indiscriminability have been shown to be identical:

Theorem 1

([16]). Let F be a real-valued function that is non-zero over some open interval containing \(r\in \mathbb {R}\), \(r\ne 0\). If F is continuously differentiable at r, then

$$ {\text {ID}}_F(r) \triangleq \frac{r\cdot F'(r)}{F(r)} = \mathrm {IntrDim}_{F}(r) = \mathrm {InDiscr}_{F}(r) . $$

Let \(\mathbf {x}\) be any reference location within a data domain \(\mathcal {S}\) equipped with a distance measure d. To any point \(\mathbf {y}\in \mathcal {D}\) we can associate the distance \(r=d(\mathbf {x},\mathbf {y})\); in this way, any global data distribution over \(\mathcal {D}\) induces a local distance distribution with respect to \(\mathbf {x}\). Motivated by a need to characterize the local intrinsic dimensionality in the vicinity of individual reference points, we are interested in the limit of \({\text {ID}}_F(r)\) as the distance r tends to 0. For convenience, for non-zero distances r we refer to \({\text {ID}}_F(r)\) as the indiscriminability of F at r, and to \({\text {ID}}^{*}_F\triangleq \lim _{r\rightarrow 0}{\text {ID}}_F(r)\) as the local intrinsic dimension (or LID) of F.

In the ideal case where the data in the vicinity of \(\mathbf {x}\) is distributed uniformly within a submanifold in \(\mathcal {D}\), \({\text {ID}}^{*}_F\) would equal the dimension of the submanifold; however, in general these distributions are not ideal, the manifold model of data does not perfectly apply, and \({\text {ID}}^{*}_F\) is not necessarily an integer. Nevertheless, the local intrinsic dimensionality would give an indication of the dimension of the submanifold containing \(\mathbf {x}\) that would best fit the data distribution in the vicinity of \(\mathbf {x}\).

The indiscriminability function \({\text {ID}}_{F}\) can be seen to fully characterize its associated function F.

Theorem 2

(LID Representation Theorem [16]). Let \(F:\mathbb {R}\rightarrow \mathbb {R}\) be a real-valued function, and assume that \({\text {ID}}^{*}_{F}\) exists. Let x and w be values for which x/w and F(x)/F(w) are both positive. If F is non-zero and continuously differentiable everywhere in the interval \([\min \{x,w\},\max \{x,w\}]\), then

$$\begin{aligned} \frac{ F(x) }{ F(w) }= & {} \left( \frac{x}{w}\right) ^{{\text {ID}}^{*}_{F}} \cdot {G}_{F}(x,w) ,~\mathrm {where} \\ {G}_{F}(x,w)\triangleq & {} \exp \left( \int _{x}^{w} \frac{{\text {ID}}^{*}_{F}-{\text {ID}}_{F}(t)}{t} \,\mathrm {d}t \right) , \end{aligned}$$

whenever the integral exists.

In  [16], conditions on x and w are provided for which the \({G}_{F}(x,w)\) can be seen to tend to 1 as \(x,w\rightarrow 0\). The function \({G}_{F}(x,w)\) is related to the slowly-varying functions of long-standing interest within the EVT research community  [10, 25], as it governs the rate of convergence of F to its ideal (asymptotic) form. Next, we revisit this issue so as to prove statements useful for the analysis of limits of density ratios, that do not explicitly rely on formulations involving \({G}_{F}\).

3 LID-Aware Density Ratios

In this section, we present statements concerning the limits of ratios of two functions, with properties that allow them to be applied to situations in which density ratios are to be modeled. Instead of considering the ratio of pdfs over an infinitesimal volume whose dimensions depend on knowledge of a full-dimensional coordinate system (as in the usual sense of density ratio estimation), we instead consider as the underlying volume spheres with infinitesimal radii. Local density ratios can be then be described as ratio of two cdf functions of distance distributions, each representing the probability measure captured by a sphere, whose radii (the function arguments) both tend to zero in some fashion.

Although this work is motivated by an interest in modeling limits of density ratios, the theoretical statements will be presented in a more generally applicable way. Here, we assume only that the functions are smooth (continuously differentiable), positive, and vanish at zero—all properties that are assumed, either explicitly or implicitly, in traditional pdf-based density ratio estimation. For each of the statements, we show how the existence of ratio limits relates to the underlying local intrinsic dimensionalities of the two functions involved, in a natural way.

3.1 Limits of Ratios of Different Functions

Intuitively speaking, the first result to be presented shows that for a density ratio limit to be greater than zero, the LID values of the numerator function and denominator function must be the same—any difference in the LID values would result in the limit either vanishing or diverging.

Lemma 1

Let \(\alpha ,\beta :\mathbb {R}^{\ge 0}\rightarrow \mathbb {R}^{\ge 0}\) be functions such that \(\alpha (0)=\beta (0)=0\), and for some value of \(r>0\), their restrictions to the interval (0, r) are continuously differentiable and positive. Let us also assume that \({\text {ID}}^{*}_{\alpha }\) and \({\text {ID}}^{*}_{\beta }\) both exist and are positive. Further, let \(\varDelta = {\text {ID}}^{*}_{\beta }-{\text {ID}}^{*}_{\alpha }\), and let \(\beta _0\) be the function obtained by decomposing \(\beta \) into \(\beta (u)=u^{\varDelta }\cdot \beta _0(u)\). Consider the limits of ratios

$$ \lambda \,\, \triangleq \,\, \lim _{u\rightarrow 0} \frac{\beta (u)}{\alpha (u)} \,\,\,\,\, and \,\,\,\,\, \lambda _0 \,\, \triangleq \,\, \lim _{u\rightarrow 0} \frac{\beta _0(u)}{\alpha (u)} . $$

If \(\lambda \) exists and is positive, then \({\text {ID}}^{*}_{\alpha }={\text {ID}}^{*}_{\beta }\). Alternatively, if \(\lambda _0\) exists and is positive, then the following statements hold:

  1. 1.

    \(\lambda =0\) if and only if \({\text {ID}}^{*}_{\alpha }<{\text {ID}}^{*}_{\beta }\) (in which case \(\varDelta >0\));

  2. 2.

    \(\lambda >0\) if and only if \({\text {ID}}^{*}_{\alpha }={\text {ID}}^{*}_{\beta }\) (in which case \(\varDelta =0\) and \(\lambda = \lambda _0\));

  3. 3.

    \(\lambda \) diverges to \(+\infty \) if and only if \({\text {ID}}^{*}_{\alpha }>{\text {ID}}^{*}_{\beta }\) (in which case \(\varDelta <0\)).

Proof

For the limit \(\lambda _0\) to exist, \(\alpha (0)=0\) implies that \(\beta _0(0)=0\) as well. Note also that \(\beta _0\) must be continuously differentiable and positive over the full range (0, r). L’Hôpital’s rule can therefore be applied together with Theorem 1, yielding

$$\begin{aligned} \lambda _0 \,\, = \,\, \lim _{u\rightarrow 0} \frac{\beta _0(u)}{\alpha (u)}&\, = \,&\lim _{u\rightarrow 0} \frac{\beta '_0(u)}{\alpha '(u)} \,\, = \,\, \lim _{u\rightarrow 0} \frac{{\text {ID}}_{\beta _0}(u)\cdot \beta _0(u)}{{\text {ID}}_{\alpha }(u)\cdot \alpha (u)} . \end{aligned}$$

The right-hand limit can be separated to produce

$$ \lambda _0 \,\, = \,\, \lim _{u\rightarrow 0} \frac{\beta _0(u)}{\alpha (u)} \,\, = \,\, \frac{{\text {ID}}^{*}_{\beta _0}}{{\text {ID}}^{*}_{\alpha }} \lim _{u\rightarrow 0} \frac{\beta _0(u)}{\alpha (u)} \,\, = \,\, \frac{{\text {ID}}^{*}_{\beta _0}}{{\text {ID}}^{*}_{\alpha }} \cdot \lambda _0 . $$

These equalities imply that whenever \(\lambda _0>0\), we have that \({\text {ID}}^{*}_{\alpha }={\text {ID}}^{*}_{\beta _0}\), and also that \({\text {ID}}^{*}_{\beta } = {\text {ID}}^{*}_{\beta _0} + \varDelta \). Similar arguments also show that \({\text {ID}}^{*}_{\alpha }={\text {ID}}^{*}_{\beta }\) whenever \(\lambda \) exists and is positive.

The existence of limit \(\lambda _0\) has implications for the existence of \(\lambda \), which can be expressed as

$$\begin{aligned} \lambda \,\, = \,\, \lim _{u\rightarrow 0} \frac{\beta (u)}{\alpha (u)}&\, = \,&\lim _{u\rightarrow 0} \frac{u^{\varDelta }\cdot \beta _0(u)}{\alpha (u)} \,\, = \,\, \lambda _0 \lim _{u\rightarrow 0} u^{\varDelta } . \end{aligned}$$

Whenever \(\lambda _0\) exists and is positive, we have that \(\lambda \) behaves as the limit of \(u^{\varDelta }\) as \(u\rightarrow 0\), in that \(\lambda =0\) if and only if \(\varDelta >0\), and \(\lambda \) diverges to \(+\infty \) if and only if \(\varDelta < 0\). Otherwise, \(\lambda >0\) if and only if \(\varDelta =0\), in which case \(\lambda =\lambda _0\).    \(\square \)

Lemma 1 tells us that in an asymptotic sense, density ratios are only meaningful when the local intrinsic dimensionalities of the two functions are equal. This accords well with what is already known of standard density ratios as defined using probability densities. However, the LID model has a distinct advantage in that it is implicitly defined for any locality that admits a ‘local’ distance distribution, and that when modeling a ‘global’ distribution, no arbitrary universally-fixed local intrinsic dimensionality need be imposed.

3.2 Parameterized Limits Involving a Single Function

Next, we turn our attention to the effect of applying a common function to the numerator and denominator of an existing ratio. Equivalently, this situation can be regarded as the effect on an existing ratio of two parameterized values of the same function, as the common parameter tends to zero. Here, we state and prove results showing that the LID Representation Theorem applies with simplified conditions, in terms of the existence of a limit involving the parameterization itself.

Lemma 2

Let \(F:\mathbb {R}^{\ge 0}\rightarrow [0,1]\) be a non-decreasing function, and assume that \({\text {ID}}^{*}_{F}\) exists. Let \(\alpha ,\beta :\mathbb {R}^{\ge 0}\rightarrow \mathbb {R}^{\ge 0}\) be functions such that \(\alpha (0)=\beta (0)=0\), and for some value of \(r>0\), their restrictions to the interval [0, r) are continuously differentiable and strictly monotonically increasing. For any constant \(c\ne 0\),

$$\begin{aligned} \lim _{u\rightarrow 0} \left( \frac{\beta (u)}{\alpha (u)} \right) ^c \cdot {G}_{F}\left( \alpha (u),\beta (u)\right)&\, = \,&\lim _{u\rightarrow 0} \left( \frac{\beta (u)}{\alpha (u)} \right) ^c \,\, = \,\, \lambda ^c \end{aligned}$$
(1)

whenever the limit \(\lambda =\lim _{u\rightarrow 0} \frac{\beta (u)}{\alpha (u)}\) exists and is positive. If instead \(c>0\) and \(\lambda \) diverges to \(+\infty \), or if \(c<0\) and \(\lambda =0\), then the limits in Eq. 1 both diverge to \(+\infty \). Otherwise, if \(c>0\) and \(\lambda =0\), or if \(c<0\) and \(\lambda \) diverges to \(+\infty \), then the limits in Eq. 1 are both zero.

Proof

Since \({\text {ID}}^{*}_{F}\) is assumed to exist, for any real value \(\epsilon \in (0,r)\) there must exist a value \(0<\delta<\epsilon <r\) such that \(t < \delta \) implies that \(|{\text {ID}}_{F}(t)-{\text {ID}}^{*}_{F}| < \epsilon \). Therefore, when \(0< \alpha (u) < \delta \) and \(0< \beta (u) < \delta \),

Exponentiating, and multiplying through by \(\left( \beta (u)/\alpha (u)\right) ^c\), we obtain

$$ \left( \frac{\beta (u)}{\alpha (u)} \right) ^{c-\epsilon _0} \,\, \le \,\, \left( \frac{\beta (u)}{\alpha (u)} \right) ^{c} \cdot {G}_{F}\left( \alpha (u),\beta (u)\right) \,\, \le \,\, \left( \frac{\beta (u)}{\alpha (u)} \right) ^{c+\epsilon _0} \, , $$

where \(\epsilon _0 = \epsilon \) if \(\beta (u) \ge \alpha (u)\), and \(\epsilon _0 = {-\epsilon }\) otherwise.

Let us assume that the limit \(\lambda =\lim _{u\rightarrow 0} \beta (u) / \alpha (u)\) exists. If \(\lambda >0\), then \(\lim _{u\rightarrow 0} \left( \beta (u) / \alpha (u) \right) ^{c}\) also exists, and equals \(\lambda ^c\). In this case, as \(\epsilon \) and \(\delta \) tend to 0, the monotonicity of \(\alpha (u)\) and \(\beta (u)\) implies that u is driven to 0 as well. Thus, we have that

$$ \lim _{u\rightarrow 0} \left( \frac{\beta (u)}{\alpha (u)} \right) ^{c} \cdot {G}_{F}\left( \alpha (u),\beta (u)\right) \, = \, \lambda ^{c} \, = \, \lim _{u\rightarrow 0} \left( \frac{\beta (u)}{\alpha (u)} \right) ^{c} . $$

However, if \(\lambda =0\) or diverges to \(+\infty \), then similar arguments show that the limits in Eq. 1 both diverge to \(+\infty \) (if \(c>0\) and \(\lambda \) diverges, or \(c<0\) and \(\lambda =0\)) or both converge to 0 (if \(c>0\) and \(\lambda =0\), or \(c<0\) and \(\lambda \) diverges).    \(\square \)

Lemma 2 states conditions for which the LID representation function \(G_F\) can be ignored. Using the lemma leads to the following simplified restatement of the LID Representation Theorem itself.

Theorem 3

Let \(F:\mathbb {R}^{\ge 0}\rightarrow [0,1]\) be a non-decreasing function, and assume that \({\text {ID}}^{*}_{F}\) exists and is positive. Let \(\alpha ,\beta :\mathbb {R}^{\ge 0}\rightarrow \mathbb {R}^{\ge 0}\) be functions such that \(\alpha (0)=\beta (0)=0\), and for some value of \(r>0\), their restrictions to the interval [0, r) are continuously differentiable and strictly monotonically increasing. Then

$$\begin{aligned} \lim _{u\rightarrow 0} \frac{F(\beta (u))}{F(\alpha (u))} \; = \; \lim _{u\rightarrow 0} \left( \frac{\beta (u)}{\alpha (u)} \right) ^{{\text {ID}}^{*}_{F}} \;\, = \;\, \lambda ^{{\text {ID}}^{*}_{F}} \end{aligned}$$
(2)

whenever the limit \(\lambda =\lim _{u\rightarrow 0} \frac{\beta (u)}{\alpha (u)}\) exists. If instead \(\lambda \) diverges to \(+\infty \), then the limits in Eq. 2 both diverge to \(+\infty \).

Proof

Applying Theorem 2 to the first limit in Eq. 2 yields

$$\begin{aligned} \lim _{u\rightarrow 0} \frac{F(\beta (u))}{F(\alpha (u))} \, = \, \lim _{u\rightarrow 0} \left( \frac{\beta (u)}{\alpha (u)} \right) ^{{\text {ID}}^{*}_{F}} \cdot {G}_{F}\left( \alpha (u),\beta (u)\right) . \end{aligned}$$

The result then follows from Lemma 2, with \(c={\text {ID}}^{*}_{F}\).    \(\square \)

3.3 Limits of Ratios of Inverse Functions

We next show the effect of taking the limit of ratios of the inverse of functions, in terms of the original functions as well as their local intrinsic dimensionalities. Here, we find that the limit of the ratio of the functions equals that of the ratio of their inverses, raised to the power of the local intrinsic dimension of either function.

When the original function limits are interpreted as the probability measure captured within neighborhoods whose radii tend to zero, the limits of their inverse functions can be regarded as the radii associated with neighborhoods whose captured probability measure tends to zero (as the neighborhoods shrink). Assessing the limits of ratios involving inverse functions thus gives modelers greater flexibility in designing estimators for density ratios.

Theorem 4

Let \(\alpha ,\beta :\mathbb {R}^{\ge 0}\rightarrow \mathbb {R}^{\ge 0}\) be functions such that \(\alpha (0)=\beta (0)=0\), and for some value of \(r>0\), their restrictions to the interval [0, r) are continuously differentiable and strictly monotonically increasing. Let us also assume that \({\text {ID}}^{*}_{\alpha }\) and \({\text {ID}}^{*}_{\beta }\) both exist and are positive. If the limit ratio \(\lambda \triangleq \lim _{u\rightarrow 0} \frac{\beta (u)}{\alpha (u)}\) exists, then

$$\begin{aligned} \lambda&\, = \,&\lim _{p\rightarrow 0} \left( \frac{\alpha ^{-1}(p)}{\beta ^{-1}(p)} \right) ^{{\text {ID}}^{*}_{\alpha }} \, = \,\, \lim _{p\rightarrow 0} \left( \frac{\alpha ^{-1}(p)}{\beta ^{-1}(p)} \right) ^{{\text {ID}}^{*}_{\beta }} . \end{aligned}$$
(3)

In addition, the limits in Eq. 3 diverge to \(+\infty \) whenever \(\lambda \) diverges to \(+\infty \).

Proof

Since \(\alpha (0)=\beta (0)=0\), the strict monotonicity of \(\alpha \) and \(\beta \) over the range [0, r) imply that their inverse functions \(\alpha ^{-1}(p)\) and \(\beta ^{-1}(p)\) both exist for all \(p\in [0,s)\), where \(s\triangleq \min \{\alpha (r),\beta (r)\})\). Moreover, the inverse function theorem implies that \(\alpha ^{-1}\) and \(\beta ^{-1}\) are themselves continuously differentiable over [0, s).

Consequently, using Theorem 3, the limit ratio \(\lambda \) can be expanded as follows:

after the substitution of \(p = \beta (u)\). Hence, Theorem 3 guarantees that all these limits exist and equal \(\lambda \) if the limit \(\lambda \) exists; otherwise, if \(\lambda \) diverges to \(+\infty \), then the limits all diverge. Note that the last equality holds due to Lemma 1, which states that \({\text {ID}}^{*}_{\alpha }={\text {ID}}^{*}_{\beta }\) for the case when \(\lambda >0\).    \(\square \)

It is worth noting here that, as shown by Lemma 1, these limits are only positive when the LID values of the two original functions are identical; otherwise, the limits either both vanish or both diverge.

4 Conclusion

In this paper, we presented theoretical statements that can serve as a foundation for modeling density ratios in terms of local intrinsic dimensionality. These formulations give greater flexibility when modeling data under the assumption of local variation in intrinsic dimensionality, in that no explicit dependence on a fixed-dimensional data representation is required. In particular, the result of Theorem 4 on the existence and nature of the ratio of inverse functions allows modelers to move flexibly and interchangeably between a distance-based model of local probability density on the one hand, and a probability-based model of neighborhood radius ratios on the other—all taking into proper account the effect of local intrinsic dimensionality.

As mentioned earlier, distributional modeling of density ratios is already a well-established strategy within the machine learning community  [34]. The density ratio formulation considered in this paper can (in principle) be substituted for conventional pdf-based density ratios in models involving smooth (continuously differentiable) distributions—the ratio of pdfs thereby being replaced by the limits of ratios of cdfs of distance distributions. In data mining and other settings where the data is not explicitly modeled in terms of smooth distributions, the connection to LID-aware density ratios can still be made, albeit less directly. A data set consisting of n points can be regarded as a sample drawn from some unknown global distribution that, from any given point of interest \(\mathbf {x}\), induces a distribution of distances. Letting F denote the cdf of the distribution of distances to \(\mathbf {x}\), the expected number of data points lying within distance r of \(\mathbf {x}\) is simply \(n\cdot F(r)\). If r is the radius of the k-nearest neighborhood of \(\mathbf {x}\), then k/n can serve as a (crude) estimate of F(r). In this sense, within the underlying continuous model, formulations involving k, n and r can be viewed as sample-based estimators of formulations involving F and \({\text {ID}}_{F}\). Continuous limit-based and LID-aware forms such as those considered in this paper can consequently be applied to yield explanations of asymptotic behavior, as the sample size n tends to infinity. Moreover, an LID-aware asymptotic analysis could conceivably suggest alternative heuristics with improved performance characteristics.

In the data mining setting of outlier detection  [7], density ratio modeling has already assumed a place of central importance. The classic (and still state-of-the-art) LOF family of outlier detection methods  [5] assess the outlierness of a test data point through a ratio of densities, one at the test point, and the other with respect to an aggregation of points in the vicinity of the test point. LOF formulations essentially estimate density using neighborhood-based criteria, in terms of the radii within which a predetermined number of points are captured. However, for those settings where the dimensional characteristics are assumed to vary from locality to locality, Theorem 4 implies that any use of the ratio of distances to estimate density ratios should also take the local intrinsic dimensionality into account. The theoretical results of this paper underscore the need for the development of LID-aware local outlier detection techniques; work in this direction is already underway.

LID-aware density ratio modeling may also have useful applications in density-based clustering  [6] and other non-parametric unsupervised learning settings that exploit similarity information. One such possibility involves the well-known DBSCAN family of clustering methods  [12], which relies on absolute density thresholding to determine clusters, where density is (typically) estimated in terms of the numbers of points enclosed within neighborhoods of fixed radius. Extension of the DBSCAN strategy to account for LID-aware density would potentially allow for the formation of clusters whose densities are not necessarily high in an absolute sense, but instead high relative to the density within some background distribution. This has the advantage that locally-dense configurations of data points within sparse regions may still be discoverable as clusters, even in the presence of other clusters in regions that are much more dense.

In the larger sense, with its unification of the notions of distance, probability, and local dimensionality, LID-aware modeling presents an opportunity to reconcile established heuristics in the area of data mining with the distributional framework that serves as the foundation of machine learning and other areas in AI. By breaking the dependence of models on a global dimension parameter, it also offers better support in subspace-based modeling and other contexts in which sparse-featured solutions are sought. The theoretical results presented in this paper on LID-aware density ratios will hopefully encourage and support further research in this direction.