Keywords

1 Introduction

Given a dataset in a metric space, the selection of a representative subset of the dataset is a common operation in data analysis or for data indexing. It is well known that obtaining a decent approximation of cluster centers prior to running a clustering algorithm such as \(k\)-means improves not only the speed of convergence but also the quality of the final result [2].

Pivot-based exact and approximate indexing techniques are based on the prior selection of a pivot set which is used in two main mechanisms. Defining pivots as landmarks in the metric space allows to precompute and store distance values from all data to this set and use this information along with the triangle inequality to build an exclusion criterion [5].

Pivots may also be used as landmarks to represent the data in permutation-based indexing strategies. The query locates data in its neighborhood by activating pivots and selecting data with similar activation. In both cases the idea is to restrict the number of data for which the exact distance computation is performed [1, 3, 10]

In the parallel field of data visualization of large data (outside the scope of this paper) the smart sub-sampling of the dataset into a reduced representative subset ensures smooth and accurate display.

In this paper, we first study the approaches for data sampling and the possible constraints that can be set, namely statistical or geometrical. We then propose the Hubness Half Space Partitioning (HubHSP) that builds on the Half Space Partitioning (HSP [4]) to construct a data selector that effectively combines such geometrical and statistical constraints.

We demonstrate empirically the validity and stability of our proposal in various experimental conditions.

2 Dataset Sampling Strategies

Given a N-sized dataset \(\mathcal {X} = \{x_i\}_{i\in \mathopen {[\![}N\mathclose {]\!]}}\) of \(\varOmega \subseteq \mathbb {R}^D\), classical data sampling strategies are generally either based on statistical or geometric constraints.

2.1 Density-Based Sampling

One natural way to approach dataset re-sampling is from a statistical perspective. Here, the dataset \(\mathcal {X} \) is supposed to be a N-sized i.i.d sample of a probability density function (pdf) \(f_\mathcal {X} \). In other words, \(\{x_i\}_{i\in \mathopen {[\![}N\mathclose {]\!]}}\) is one realization of a set of N independent random variables \(\{X_i\}_{i\in \mathopen {[\![}N\mathclose {]\!]}}\) identically distributed according to this pdf (\(X_i \sim f_\mathcal {X} \), \(i\in \mathopen {[\![}N\mathclose {]\!]}\)).

Re-sampling dataset \(\mathcal {X} \) into subset \(\mathcal {Y} =\{y_j\}_{j\in \mathopen {[\![}n\mathclose {]\!]}}\) with \(n\le N\) therefore amounts to make a selection \(\mathcal {Y} \subseteq \mathcal {X} \) of n data from \(\mathcal {X} \) into \(\mathcal {Y} \). In this case a subset of indices \(i_j\in \mathopen {[\![}N\mathclose {]\!]}\) is chosen so that \(y_j = x_{i_j}\forall j\in \mathopen {[\![}n\mathclose {]\!]}\). As shown below, a uniform sampling of indices from within \(\mathopen {[\![}N\mathclose {]\!]}\) guarantees that \(\mathcal {Y} \) is also a sample of pdf \(f_\mathcal {X} \) (i.e. \(f_\mathcal {Y} =f_\mathcal {X} \)).

Representation Properties. Maintaining the probability density function of a sample has specific implications. Statistically, a high value of the pdf at a location \(x\in \varOmega \) makes the likelihood of a sample at this location \({{\,\textrm{P}\,}}(X_i=x)\) accordingly high.

Conversely, a crude empirical estimate of the value of the pdf at location x, \(\hat{f}_\mathcal {X} (x)\) is given by the density of samples from \(\mathcal {X} \) around x. Classically, the density is defined as the number of objects of interest per unit of volume. Hence, we can define

$$\hat{f}_\mathcal {X} (x) =\frac{|\mathcal {X} \cap \mathcal{B}(x,\rho )|}{{{\,\textrm{vol}\,}}(\mathcal{B}(x,\rho ))}~~\text {for some small}~\rho >0$$

where we consider the ball \(\mathcal{B}(x,\rho )=\{y\in \varOmega ~|\!\!|~d(x,y)\le \rho \}\) as a unit volume. In practice, we only have access to the data from \(\mathcal {X} \). Hence the estimate is only non-zero when the ball \(\mathcal{B}(x,\rho ))\) contains data samples. As a result, we are led to using the k nearest neighbors of x from \(\mathcal {X} \) (\(\mathcal{V}^k_\mathcal {X} (x)\)) to estimate the density:

$$\begin{aligned} \hat{f}_\mathcal {X} (x) =\frac{k}{{{\,\textrm{vol}\,}}(\mathcal{V}^k_\mathcal {X} (x))}~~\text {for some}~k>0 \end{aligned}$$
(1)

Note that following the above, the volume \({{\,\textrm{vol}\,}}(\mathcal{V}^k_\mathcal {X} (x))\) can be the volume of the enclosing ball (\({{\,\textrm{vol}\,}}(\mathcal{V}^k_\mathcal {X} (x)) = {{\,\textrm{vol}\,}}(\mathcal{B}(x,\rho ))\) with \(\rho \) the distance to the \(k^{\text {th}}\) neighbor).

This view justifies that \(\hat{f}_\mathcal {X} (x) =\hat{f}_\mathcal {Y} (x)\) as follows [11]:

Let \({{\,\textrm{P}\,}}(x_j\in \mathcal{V}^k_\mathcal {X} (x_i))=p_{j|i}\)

then \({{\,\textrm{P}\,}}(x_j\in \mathcal{V}^k_\mathcal {Y} (x_i))={{\,\textrm{P}\,}}(x_j\in \mathcal{V}^k_\mathcal {X} (x_i), x_j\in \mathcal {Y}){\mathop {=}\limits ^{\perp \!\!\! \perp }}p_{j|i}{{\,\textrm{P}\,}}(x_j\in \mathcal {Y})\).

If we sample uniformly n indices \(j\in \mathopen {[\![}N\mathclose {]\!]}\) then \({{\,\textrm{P}\,}}(x_j\in \mathcal {Y})=\frac{n}{N}\). As a result, \({{\,\textrm{P}\,}}(x_j\in \mathcal{V}^k_\mathcal {Y} (x_i))\propto p_{j|i}\) and the normalization ensures that \(\hat{f}_\mathcal {X} \) and \(\hat{f}_\mathcal {Y} \) are estimates of the same original density \(f_\mathcal {X} \).   \(\square \)

This also pinpoints the fact that since \(\mathcal {Y} \subseteq \mathcal {X} \) preserves the original density \(f_\mathcal {X} \) then \(\mathcal {X} \) can be uniformly partitioned into equivalence classes whose representative centers are points \(x_j \in \mathcal {Y} \) and the respective radii depend on the local density.

From (1), for a fixed k, \(\hat{f}_\mathcal {X} \) varies according to the value of \({{\,\textrm{vol}\,}}(\mathcal{V}^k_\mathcal {X} (x))\). The larger the volume is required to hold the kNN, the lower the density. Hence, based on kNN, the radii of Dirichlet domainsFootnote 1 in \(\mathcal {X} \) centered at \(\mathcal {Y} \) adapt to the local density. In that respect, density-based sampling corresponds to nearest neighbor queries with fixed k (i.e. kNN queries).

The direct implication of the above properties is that, if an indexing technique uses the above-defined \(\mathcal {Y} \) as representative (pivot) set, then the inverted lists \(\mathcal {L} _j\) associated with pivots \(x_j\) and defined byFootnote 2

$$\mathcal {L} _j=\{x_i\in \mathcal {X} ~|~d(x_i,x_j)\le d(x_i,x_k)~~\forall x_k\in \mathcal {Y} \}$$

are of constant size (\(\mathsf{E\!\!\!I}~|\mathcal {L} _j|\simeq N/n\)). Such a strategy is therefore profitable for indexing where obtaining short inverted lists is desirable for performance and a uniform partition of \(\mathcal {X} \) into inverted lists guarantees this minimum.

However, preserving the density of representative samples and therefore creating a non-uniform geometrical partition of the data space is adverse at time of (geometrically) locating the query with respect to the dataset. At the time of locating the query, the relevance of a pivot \(x_j\in \mathcal {Y} \) is related to its covering radius (e.g. \({{\,\textrm{vol}\,}}(\mathcal{B}(x_j,\rho ))\).

Further, given a fixed representation budget of pivots, the highest value for the lower bound for the distance from any query to any pivot is given by a geometrically uniform partition of the space. Emphasizing geometry (rather than density) therefore supports a more robust exclusion mechanism. For the same reason, it is also known that permutation-based indexing schemes that locate data by pivot activation benefit from a uniform partition of the data space by pivots [1].

2.2 Geometry-Based Sampling

We therefore investigate the construction of a set of representatives \(\mathcal {Y} \) based on geometric constraints. Dataset \(\mathcal {X} \) is typically embedded into a domain \(\varOmega \subset \mathbb {R}^D\) that can be sampled using a D-dimensional regular lattice. Should any element from \(\mathcal {X} \) fall into a simplex from the lattice, the center of that simplex (or the closest data from \(\mathcal {X} \)) may be taken as a representative. Basic examples of such a sampling include regular quantization of the coordinates of the original domain, or after applying some analysis such as PCA to discover (and potentially decimate) uncorrelated coordinates.

Representation Properties. Such a sampling strategy offers the advantage that the representative set \(\mathcal {Y} \) lies close to a regular lattice and this regular structure may be exploited by the indexing.

To ensure geometric representation properties for \(\mathcal {X} \), the criterion can be expressed as “\(\mathcal {Y} \) covers uniformly the convex hull of \(\mathcal {X} \)”, where the covering can be quantified by the k-center criterion:

$$\mathcal {Y} = \mathop {\textrm{argmin}}_{\begin{array}{c} \mathcal {S} \subset \mathcal {X} \\ |\mathcal {S} |=k \end{array}}\max _{x\in \mathcal {X}}d(x,\mathcal {S})$$

where, \(d(x,\mathcal {S})=\min _{x'\in \mathcal {S}}d(x,x')\). It is ensuring that data in \(\mathcal {X} \) is never far from a sample in \(\mathcal {Y} \). This is equivalent to minimizing the diameter of the Dirichlet domains built from \(\mathcal {Y} \) of size k in \(\mathcal {X} \). In that respect, geometric sampling corresponds to nearest neighbor queries with fixed range \(\varepsilon \) (i.e. range queries to uncover the \(\varepsilon \)NN ). In that case, pivots are associated to a fixed covering radius and inverted lists have lengths adapting to the local density.

3 Homogeneous Space Partitioning

3.1 Half Space Partitioning

In [7, 9], we demonstrated that the local degree of the neighborhood graph built using the Half Space Partitioning (HSP) strategy [4] is an accurate proxy for the measurement of local intrinsic dimensionality. This is an important property for designing a geometrically efficient sampling strategy.

Fig. 1.
figure 1

HSP construction and discarding strategy. The (red) center data \(x_i\) chooses its closest (green) neighbor \(x_j\) as HSP neighbor and discards all data closer to \(x_j\) than to itself (shaded half-space). \(x_k\) will be selected as next closest neighbor (as symbolized by the dashed circle) and the next half-space (below the blue dashed line) discarded, until no neighbor of \(x_i\) remains (Color figure online)

Algorithm 1 recalls the construction of the HSP, illustrated for the 2D case in Fig. 1. The HSP strategy partitions the hypersphere around every \(x_i\) into cones (see green dashed lines). In the HSP graph, each data point is connected (red edge) with its HSP neighbors and their mutual arrangement and the relationship with the Kissing number correlates their degree with the local dimensionality of the data [9]. Note that there is no upper bound for the distance value from \(x_i\) to the next selected HSP neighbor.

figure a

The construction of the HSP graph is highly parallel since the neighborhood of every point is computed independently of the rest. While this is a clear computational benefit and makes the HSP graph reproducible however the dataset is given, it makes the structure of the HSP graph unpredictable, apart from its properties arising from sphere packing.

In particular, no control is applied over the indegree of every node (the number of edges pointing to every node). As a result, there is no guarantee for a strong overlap of the HSP neighborhoods of 2 close points. Further, the specific structure of the HSP graph is sensitive to any data perturbation that would flip the order in which data appears as nearest neighbors of each other. In a setting where we use a point neighborhood as its representative, we would rather like to introduce correlation between neighborhoods of close points so as to:

  • ensure that 2 close points share representatives (geometric consistency)

  • obtain a compact, stable and sound representative sample of the data (statistical consistency)

  • minimize the overall number of representatives

Here, we propose the “Hubness-HSP” (HubHSP for short) as a graph spanner over \(\mathcal {X} \) supporting the selection of a representative set \(\mathcal {Y} \). We first propose the rationale for its construction and then derive the actual construction algorithm. We finally study and experimentally investigate the properties of the resulting HubHSP spanner for dataset sampling.

We wish to define the HubHSP as a structure that supports the selection of a representative set, while maintaining the favorable geometric properties of the HSP: \(x_j\) being selected as a neighbor of \(x_i\) means that \(x_j\) represents the vicinity of \(x_i\) and we wish to concentrate this representation into a given budget of representatives \(\mathcal {Y} \). The base adaptation is therefore to install a control over the indegree of the nodes in the HubHSP. By enforcing nodes with high indegree, we create “centrality hubsFootnote 3” that can be used to define representatives \(\mathcal {Y} \) from the full set \(\mathcal {X} \).

We therefore define a “hubness factor” \(h_j\) at every node \(x_j\), which corresponds to its indegree during construction. Hence \(\sum _jh_j=N\) and the challenge is to allocate \(h_j\) values so as to obtain concentrated hubs.

We build the graph following the aggregative compounding principle (see Fig. 2): a new data is matched with its HubHSP neighbors (line 9 in Algorithm 2) according to the HSP geometry while maintaining the most concentrated hubness by privileging existing hubs. Hence, at an intermediate stage, a data \(x_i\) is connected to the strongest current hub \(x_j\) from within its vicinity, and activates the HSP half-plane point discarding strategy.

figure b

We comment the main lines of Algorithm 2:

  • Line 9: the current data \(x_i\) inspects a given vicinity \(\mathcal{V}(x_i)\) (e.g. its 100-NN neighborhood) and finds the data \(x_j\) of current maximal hubness \(h_j=\max _{x_k\in \mathcal{V}(x_i)}h_k\).

  • Lines 10–11: \(x_j\) is added as neighbor to \(x_i\) by creating an edge \((x_i,x_j)\) and therefore increasing the hubness (indegree) \(h_j\) of \(x_j\).

  • Line 12: The natural distance-based selection in the HSP guarantees geometrical consistency [4]. This is not used anymore and to restore consistency, selected neighbors are projected onto the sphere \(C_i\) centered at \(x_i\) and containing the closest neighbor of \(x_i\) (blue circle in Fig. 2). In practice, this is done by proper normalization of vector \([x_i,x_j]\) into vector \([x_i,\tilde{x}_j]\) (see Annex).

Fig. 2.
figure 2

HubHSP construction and discarding strategy. The current (red) center data \(x_i\) chooses its (green) neighbor \(x_j\) of highest hubness (size of the data point) as HubHSP neighbor from its vicinity \(\mathcal{V}(x_i)\) (red dashed circle). It projects this data onto \(\tilde{x}_j\) on the largest empty circle (blue circle) and discards all data closest to \(\tilde{x}_j\) than to itself (shaded half-space). \(x_k\) will then be selected as next non-discarded neighbor of highest hubness and the next half-space (left to the blue dashed line, bisector of \([x_i,\tilde{x}_k]\)) discarded, until no neighbor of \(x_i\) remains non-discarded (Color figure online)

The main practical adaptations from the HSP construction strategy are:

  1. 1.

    data is selected by decreasing hubness rather than increasing distance

  2. 2.

    because of 1. above, the selection of neighbors for \(x_i\) (line 4 in Algorithm 1) has to happen within the pre-defined vicinity \(\mathcal{V}(x_i)\)

  3. 3.

    because of 1. above, to maintain geometric consistency, points are projected onto a sphere of minimal radius around \(x_i\) before selection

  4. 4.

    since we now create a chain during the construction of the HubHSP (using Q), a starting point has to be defined.

The first and main benefit of this adaptation is the creation of a hubness index \(h_j\) per datum (node in the HubHSP graph). The hubness index \(h_j\) is the indegree of node \(x_j\) in the HubHSP graph. \(h_j\) counts how many data \(x_i\) have \(x_j\) as HubHSP neighbor. A node with high hubness is therefore an interesting candidate for the representative subset. This provides a sound and natural strategy for the selection of \(\mathcal {Y} \) by simply selecting nodes in decreasing order of their indegree.

As a result, the HubHSP graph combines two properties. From its inheritance from the HSP process, the outdegree of every node reflects the local geometry (intrinsic dimensionality) of the data [9]. Through the hubness, the indegree of each node is now correlated with the statistical properties of the data.

Since in practice we need to define (limit) the vicinity \(\mathcal{V}(x_i)\) from where the HubHSP neighbors are selected (line 9 in Algorithm 2), the construction of this set impacts the resulting properties of the HubHSP graph.

  • if \(\mathcal{V}(x_i) = \mathcal{V}^k_\mathcal {X} (x_i)\), the kNN neighborhood of \(x_i\) in \(\mathcal {X} \), the span of this set is driven by the local density, as discussed above. Hence, the kNN-based HubHSP graph reflects the local density of data via arc lengths, on top of reflecting its geometry via outdegree.

  • if \(\mathcal{V}(x_i) = \mathcal{V}^{\varepsilon }_\mathcal {X} (x_i)\), the \(\varepsilon \)NN neighborhood of \(x_i\) in \(\mathcal {X} \), the span of this set is immune from the local density and it is the indegree of every neighbor that reflects this density.

Hence, the HubHSP graph adds to the HSP graph the encoding of the local density either via arc lengths (kNN) or indegree (\(\varepsilon \)NN).

3.2 Complexity

The base complexity of the HubHSP construction algorithm is \(O(N^2D)\). It mimics that of the computation of any neighborhood graph as it is dominated by selection of candidate neighbors (line 9 in Algorithm 2). Such a complexity may classically be reduced by a pre-indexing of these neighborhoods. In Sect. 4, we present results against baselines whose base complexities are of the same order.

3.3 Generic Metric Spaces

Our discussion and illustration have been concerned with metric space \((\varOmega , d)\) where \(\varOmega \subset \mathbb {R}^D\) and d(., .) is the Euclidean distance function. All definitions provided here rely on the existence of a proper distance function and therefore do generalize to other metric spaces. The precise study of the properties obtained when constructing the HubHSP in these metric spaces is out of the scope of this paper and is left for an extension.

4 Experiments

We now experiment under various conditions and compare to relevant baselines.

4.1 Dataset

To highlight the properties of our proposal, we use data with various properties in terms of density and dimension D. As a base reference, we generate 2 artificial dataset with uniform distribution \(\mathcal{U}^{100K\times 2}\) and \(\mathcal{U}^{100K\times 10}\), containing 100’000 data of dimension \(D=2\) and \(D=10\), respectively. Note that in this case, the dataset of dimension 10 with 100’000 data is rather sparse.

To depart from the uniform distribution, we generate 2 dataset \(\mathcal{N}^{100K\times 2}\) and \(\mathcal{N}^{100K\times 10}\) with the same parameters but from a centered normal distribution. While uniformity makes the density of the data the same at every point in space, the Normal distribution induces an exponential variation of the density across the space.

As a more realistic dataset, we use the 500’000 first data of the ANN SIFT (base set) benchmark [8]. In this case \(D=128\), inducing a very sparse set. We also use a dataset of Flow Cytometry data containing \(N=470'995\) \(D=18\)-dimensional data. This data is known by definition to aggregate in dense localized clusters (see Fig. 3 for a 2D glance). Its distribution is therefore far from uniform with large unpopulated parts of the space.

In all cases, we set the size n of the sample to 1% of the original size N. We fixed \(k=1000\) and \(\varepsilon =20\) to create the base neighborhoods (\(\mathcal{V}^k_\mathcal {X} (x_i)\) and \(\mathcal{V}^{\varepsilon }_\mathcal {X} (x_i)\) respectively) over which the HubHSP graph is built.

4.2 Baselines

Random. As discussed above, a uniform sampling of the data indices ensures the preservation of the statistical properties (density) of the data into the sample.

Farthest First Traversal (FFT). In contrast, this geometrical strategy aims at spreading the representative set across the dataset by approximating the k-center problem [6]. Using this strategy it is expected that the representative samples lie close to a regular grid.

Note that due to the concentration of distance phenomenon, this strategy loses its rationale in high dimensions.

k-means ++ [2] adds a random component to the above FFT strategy by making it most likely but not a strict choice, depending on the density of the data. \(k\)-means ++ is therefore interesting since it offers theoretical bounds in representation and mixes geometrical and statistical constraints, as we aim to do here.

4.3 Measures and Results

We use the following measures to assess the characteristics of our proposed sampling. Results are reported in Table 1.

The empty sphere measure (top section) quantifies the uniformity of the sampling by measuring the diameter of the largest empty sphere lying between samples. In practice it is the maximum distance between 2 neighboring samples.

Table 1. Evaluation measures across dataset and techniques. Top section: empty sphere. Middle section: lengths of inverted lists (standard deviation). Bottom section: Maximum distance. Values between parenthesis are standard deviation values

Since we wish an equipartition of the space by samples, the smaller this value is, the better the quality of the sample. We report the mean and also measure uniformity of this allocation by reporting the standard deviation (between parenthesis).

We see that in the most basic conditions (\(\mathcal{U}^{100K\times 2}\)) all sampling strategies perform similarly. When the dimension increases (e.g. \(\mathcal{U}^{100K\times 10}\)), the data becomes sparser and geometrical techniques (such as FFT) fail. Our proposal is able to consistently reduce the value of the measure while keeping the variance at a comparable level.

The length of inverted lists (middle section) is an indicator of the uniformity of the allocation of representative to the data. In practice, since we use Dirichlet domains to define the lists, the average list length is simply the ratio between the size of the data and the sample (\(\mathsf{E\!\!\!I}~|\mathcal {L} _j|= N/n\)) so only the standard deviation is reported. The smaller this value, the more uniform the partition is.

We clearly see the same trend of lower variance in the length of inverted lists and therefore more stability in the allocation of representative data.

The maximum distance (bottom section) between a data and its representative is rather based on the data. It is a geometric indicator of how well every data is represented by the sample. Ideally, every data should find a representative in its vicinity so again, the smaller this value is, the better. We report the mean and also measure uniformity of this allocation by reporting the standard deviation (between parenthesis).

Fig. 3.
figure 3

Sampling strategies by the baselines and the HubHSP over a 2D slice of the FlowCyto dataset (\(\textrm{FlowCyto}^{471k\times 2}\)) as a low-dimensional non-uniform example. In each scatter plot, the dataset is shown in green and selected representatives are shown in red. [top left] Random uniform, [Top right] FFT, [Lower left] \(k\)-means ++, [Lower right] HubHSP (ours) (Color figure online)

This measure shows that the HubHSP hubness allocates representatives closer to each data than other strategies. This is understood by the ability of the HubHSP to exploit better the statistical and geometrical properties of the data to allocate better a fixed budget of n representative data. This is made clear in the most adverse setting of high-dimensional non-uniform data (which corresponds to real dataset).

Figure 3 proposes a visual intuition of the allocation of representatives in low-dimensional non-uniform data. The resulting samples (red points) are shown over the data (green points) for all baselines and for the HubHSP. An ideal sampling should show regularity (to avoid redundancy) and respect the data density.

Whereas random sampling (top left) is inefficient by allocating redundant representative samples, the FFT (top right) is inefficient by being blind to the local density. \(k\)-means ++ (lower left) proposes an adequate mix of statistical and geometrical sampling but clearly the HubHSP (lower right) adds a form of regularity that removes local density artifacts due to random sampling and explains the effectiveness in terms of geometrical partitioning (Dirichlet domains) of the data.

Finally, Fig. 4 shows an histogram of the corresponding hubness values \(h_j\). A very large majority of these values are zero, which demonstrates the ability of the HubHSP to concentrate its indegree into only a minority of large values (since \(\sum _jh_j=N\)). This indicates that only a small percentage of data in \(\mathcal {X} \) then compete for entering \(\mathcal {Y} \).

Fig. 4.
figure 4

Hubness for the 2D slice of the FlowCyto dataset (\(\textrm{FlowCyto}^{471k\times 2}\)) shown in Fig. 2 [Lower right]. Only about 2.2% of the values are non-zero.

5 Conclusion

Subsampling a finite dataset may be considered from either a statistical or geometrical perspectives. Classical strategies focus on either of these. Based on the capability of the HSP graph to correlate with the local intrinsic dimensionality we proposed the HubHSP to generate a sound data selection criterion combining geometrical and statistical properties.

We demonstrate the ability of the HubHSP graph construction algorithm as a modification of the HSP graph construction to indicate a sound and stable selection of data as representative. We compare with classical selection algorithm and show that the HubHSP is able to create a more robust and effective sampling by a better exploitation of geometrical constraints on top of statistical sampling.

More generally, this work relates the ability of graph spanners to mirror and combine geometrical and statistical properties of non-uniform point clouds in high dimensions. In [9] diffusion over neighborhood graphs was used to exhibit that structure exploiting the link between connectivity (resp degree) and centrality. There is much to explore in this interplay of data analysis methods and data modeling techniques to particularize subsets of dataset.