Scalable density-based clustering with quality guarantees using random projections

Schneider, Johannes; Vlachos, Michail

doi:10.1007/s10618-017-0498-x

Scalable density-based clustering with quality guarantees using random projections

Published: 02 March 2017

Volume 31, pages 972–1005, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Scalable density-based clustering with quality guarantees using random projections

Download PDF

Johannes Schneider¹ &
Michail Vlachos²

767 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

Clustering offers significant insights in data analysis. Density-based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped clusters. Here, we present scalable density-based clustering algorithms using random projections. Our clustering methodology achieves a speedup of two orders of magnitude compared with equivalent state-of-art density-based techniques, while offering analytical guarantees on the clustering quality in Euclidean space. Moreover, it does not introduce difficult to set parameters. We provide a comprehensive analysis of our algorithms and comparison with existing density-based algorithms.

A New Density Clustering Method Using Mutual Nearest Neighbor

RADDACL2: a recursive approach to discovering density clusters

Article 28 November 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Clustering is an important operation for knowledge extraction. Its objective is to assign objects to groups such that objects within a group are more similar than objects across different groups. Subsequent inspection of the groups can provide important insights, with applications to pattern discovery (Whang et al. 2012), data summarization/compression (Koyutürk et al. 2005) and data classification (Chitta and Murty 2010). In the field of clustering, computationally light techniques, such as k-Means, are typically of heuristic nature, may require non-trivial parameters, such as the number of clusters, and often rely on stringent assumptions, such as the cluster shape. Density-based clustering algorithms have emerged as both high-quality and efficient clustering techniques with solid theoretical foundations on density estimation (Hinneburg and Gabriel 2007). They can discover clusters with irregular shapes and only require parameters that are relatively easy to set (e.g., minimum number of points per cluster). They can also help to assess important dataset characteristics, such as the intrinsic density of data, which can be visualized via reachability plots.

In this work, we extend the state of the art in density-based clustering techniques by presenting algorithms that significantly improve runtime, while providing analytical guarantees on the preservation of cluster quality. Furthermore, we highlight weaknesses of current clustering algorithms with respect to parameter dependency. Our performance gains and our quality guarantees are achieved through the use of random projections. A key theoretical result of random projections is that, in expectation, Euclidean distances are preserved. We exploit this in a pre-processing phase to partition objects into sets that should be examined together. The resulting sets are used to compute a new type of density estimate through sampling.

Our algorithm requires the setting of only a single parameter, namely, the minimum number of points in a cluster, which is customarily required as input in density-based techniques. In general, we make the following contributions:

We show how to use random projections to improve the performance of existing density-based algorithms, such as OPTICS and its performance-optimized version DeLi-Clu without the need to set any parameters. We introduce a new density estimate based on computing average distances. We also provide guarantees on the preservation of cluster quality and runtime.
The algorithm is evaluated extensively and yields performance gains of two orders of magnitude with provable degree of distortion on the clustering result compared with prevalent density-based approaches, such as OPTICS.

2 Background and related work

The majority of density-based clustering algorithms follow the ideas presented in DBSCAN (Ester et al. 1996), OPTICS (Ankerst et al. 1999) and DENCLUE (Hinneburg and Keim 1998). Our methodology is more similar in spirit to OPTICS, but relaxes several notions, such as the construction of neighborhood. The end result is a scalable density-based algorithm even without parallelization.

DBSCAN was the first influential approach for density-based clustering in the data-mining literature. Among its shortcomings are flat (not hierarchical) clustering, large complexity, and the need for several parameters (cluster radius, minimum number of objects). OPTICS overcame several of these weaknesses by introducing a variable density and requiring the setting of only one parameter (density threshold). OPTICS does not explicitly produce a data clustering but only a cluster ordering, which is visualized through reachability plots. Such a plot corresponds to a linear list of all objects examined, augmented by additional information, i.e., the reachability distance, that represents the intrinsic hierarchical cluster structure. Valleys in the reachability plot can be considered as indications of clusters. OPTICS has a complexity that is on the order of O($N\cdot |$neighbors|), which can be as high as O($N^2$) in the worst case, or O($N\log {N}$) in the presence of an index (discounting for the cost of building and maintaining the actual index). A similar complexity analysis applies for DBSCAN.

DENCLUE capitalizes on kernel density estimation techniques. Performance optimizations have been implemented in DENCLUE 2.0 (Hinneburg and Gabriel 2007) but its asymptotic complexity is still quadratic.

Other approaches to expedite the runtime of density-based clustering techniques involve implementations in Hadoop or using Graphical Processing Units (GPUs). For example, Cludoop (Yu et al. 2015) is a Hadoop-based density-based algorithm that reports an up to fourfold improvement in runtime. Böhm et al. (2009) presented CUDA-DClust, which improves the performance of DBSCAN using GPUs. They report an improvement in runtime of up to 15 times. G-DBSCAN (Andrade et al. 2013) and CudaSCAN (Loh and Yu 2015) are recent GPU-driven implementations of DBSCAN, and report an improvement in runtime of 100 and 160 times, respectively. Our approach uses random projections to speed up the execution, while at the same time having provable cluster quality guarantees. It exhibits an equivalent or more speedup than the above parallelization approaches, without the need for distributed execution, thus lending itself to a simpler implementation. Random projections can easily be parallelized. Thus, parallelization is likely to considerably improve performance of our algorithms.

Random projection based methodologies have also been used to speed up density-based clustering. For example, the algorithm in Urruty et al. (2007) leverages the observation that for high-dimensional data and a small number of clusters, it is possible to identify clusters based on the density of the projected points on a randomly chosen line. We do not capitalize on this observation. We attempt to determine neighborhood information using recursively applied random projections. The protocol in Urruty et al. (2007) is of “heuristic nature”, as attested by the original authors, so it does not provide any quality guarantees. It runs in time O$(n^2\cdot N + N\cdot \log N)$, where N points are projected onto n random lines. It requires the specification of several parameters, whereas our approach is parameter-light.

Randomly projected $k{-}d$-trees were introduced in Dasgupta and Freund (2008). A $k{-}d$-tree is a spatial data structure that splits data points into cells. The algorithm in Dasgupta and Freund (2008) uses random projections to partition points recursively into two sets. Our algorithm Partition shares this methodology, but uses a simpler splitting rule. We just select a projected point uniformly at random, whereas the splitting rule in Dasgupta and Freund (2008) requires finding the median of all projected points and using a carefully crafted jitter. Furthermore, we perform multiple partitionings. The purpose of Dasgupta and Freund (2008) is to serve as an indexing structure. Retrieving the k-nearest neighbors in a $k{-}d$ tree can be elaborate and suffers heavily from the curse of dimensionality. To find the nearest neighbor of a point, it may require to look at several branches (cells) in the tree. The cardinality of the branches searched grows with the dimensions. Even worse, computing k-nearest neighbors only with respect to OPTICS would mean that all information between cluster distances would be lost. More precisely, for any point of a cluster, the k-nearest neighbors would always be from the same cluster. Therefore, only knowing the k-nearest neighbors is not sufficient for OPTICS. To the best of our knowledge, there is no indexing structure that supports finding distances between two clusters (as we do). Nevertheless, an indexing structure such as Dasgupta and Freund (2008) or the one used in Achtert et al. (2006) can prove valuable, but might come with significant overhead compared with our approach. We get neighborhood information directly using small sets. Furthermore, we are also able to obtain distances between clusters. Therefore, our indexing structure might prove valuable for other applications.

Random projections have also been applied to hierarchical clustering (Schneider and Vlachos 2014), i.e., single and average linkage clustering (more precisely, Ward’s method). To compute a single linkage clustering it suffices to maintain the nearest neighbor that is not yet in the same cluster for each point. To do so, Schneider and Vlachos (2014) run the same partitioning algorithm as the one used here. In contrast to this work, it computes all pairwise distances for each final set of the partitioning. Because this requires quadratic time in the set size, it is essential to keep the maximum possible set size, $\textit{minPts}$, as small as possible. In contrast to this work, where $\textit{minPts}$ is related to the density parameter used in OPTICS, there is no relation to any clustering parameter. In fact, $\textit{minPts}$ might be increased during the execution of the algorithm. If a shortest edge with endpoints A,B is “unstable”, meaning that the two points A and B do not co-occur in many final sets, then $\textit{minPts}$ is increased. In this work there is no notion of neighborhood stability.

Multiple random projections onto one-dimensional spaces have also been used for SVM (Schneider et al. 2014). Note that for SVMs a hyperplane can be defined by a vector. The naive approach tries to guess the optimal hyperplane for an SVM using random projections. The more sophisticated approach uses local search to change the hyperplane, coordinate by coordinate.

Projection-indexed nearest neighbors have been proposed in Vries et al. (2012) for outlier detection. First, they identify potential k-nearest-neighbor candidates in a reduced dimensional space (spanned by several random projections). Then, they compute distances to this nearest-neighbor candidates in the original space to select the k-nearest-neighbors. In contrast, we perform (multiple) recursive partitionings of points using random projections to identify potential nearest neighbors.

Locality Sensitive Hashing (LSH) (Datar et al. 2004) also employs random projections. LSH does not perform a recursive partitioning of the dataset as we do, but splits the entire data set into bins of fixed width. It conducts multiple of these partitionings. Furthermore, in contrast to our technique, it requires several parameters, such as width of a bin, number of hash tables, and number of projections per hash value. These parameters typically require knowledge of the dataset for proper tuning.

3 Our approach

In density-based clustering, a key step is to discover the neighborhood of each object to estimate the local density. Traditional density-based clustering algorithms, such as OPTICS, may exhibit limited scalability partially because of the expensive computation of neighborhood. Other approaches, such as DeLi-Clu (Achtert et al. 2006), use indexing techniques to speed up the neighborhood discovery process. As discussed in the related work, spatial indexing approaches are not tailored towards the scenario of OPTICS and require fast k-nearest neighbor retrieval but also distance computation between (distant) points of different clusters.

Our approach capitalizes on a random-projection methodology to create a partitioning of the space from which the neighborhood is produced. We explain this step in Sect. 4. The intuition is that if an object resides in the neighborhood of another object across multiple projections, then it belongs to that object’s neighborhood. The majority of random-projection methodologies project high-dimensional data into a lower dimensionality d that does not depend on the original dimensionality, but is logarithmic to the dataset size (Johnson and Lindenstrauss 1984). In contrast, we run computations directly on multiple one-dimensional projections. This allows us to work on a very reduced space, in which operations, such as neighborhood construction, can be executed very efficiently. Our neighborhood construction is fast because it only requires linear time in the number of points for each projection. Note that a naive scheme looking at pairs of neighboring points would require quadratic time.

After the candidate neighboring points of each object are computed, see Sect. 5, the local density is estimated. This is described in Sect. 6. We prove that the local density computed using our approach is an O(1)-approximation of the core density calculated by the algorithm used in OPTICS given weak restrictions on the neighborhood size (depending on the distance). This essentially allows us to compute reachability plots equivalent to those of OPTICS, but at a substantially lower cost. Finally, in Sect. 8, we show empirically that our approach is significantly faster than existing density-based clustering algorithms.

This work represents an extension of Schneider and Vlachos (2013). We augment our previous work by formally stating the proofs for the theorems presented and including additional comparisons with existing density-based clustering techniques. We also make the source-code for our approach available in the public domain.

Naturally, our approach and the related proofs are focused on Euclidean distances, because random projections conserve the Euclidean distance (Table 1).

3.1 Preliminaries

We are given a set of N points $\mathcal {P}$ in the d-dimensional Euclidean space, i.e., for a point $P \in \mathcal {P}$ it holds $P \in {\mathbb {R}}^d$. We use the term whp, i.e., with high probability, to denote probability $1-1/N^c$ for an arbitrarily large constant c. The constant c (generally) also occurs as a factor hidden in the big O-notation. We often use the following Chernoff bound:

Table 1 Notation and constants used in the paper

Full size table

Theorem 1

The probability that the number X of occurred independent events $X_i \in \{0,1\}$, i.e., $X:=\sum _i X_i$, is not in $[(1-c_0){\mathbb {E}}[X],(1+c_1){\mathbb {E}}[X]]$ with $c_0\in ]0,1]$ and $c_1>0$ can be bounded by

$$\begin{aligned} p(X \le (1-c_0){\mathbb {E}}[X] \vee X \ge (1+c_1){\mathbb {E}}[X]) < 2e^{-{\mathbb {E}}[X]\cdot \min (c_0,c_1)^2/3} \end{aligned}$$

If an event occurs whp for a point (or edge) it occurs for all whp. This can be proved using Boole’s inequality (or, alternatively, consider Schneider and Wattenhofer 2011).

Theorem 2

For $n^{c_2}$ (dependent) events $E_i$ with $i \in [0,n^{c_2}-1]$ and constant $c_2$ such that each event $E_i$ occurs with probability $p(E_i)\ge 1- 1/n^{c_3}$ for $c_3 > c_2+2$, the probability that all events occur is at least $1-1/n^{c_3-c_2-2}$.

4 Pre-process: data partitioning

Our density-based clustering algorithm consists of two phases: the first partitions the data so that close points are placed in the same partition. The second uses these partitions to compute distances or densities only within pairs of the same partition. This enables much faster execution.

The partitioning phase splits the dataset into smaller sets (Partition algorithm). We perform multiple of these partitions by using different random projections (MultiPartition algorithm). Intuitively, if the projections $P\cdot L$ and $Q\cdot L$ of two points P, Q onto line L are of similar value then the points should be close. Thus, they are likely kept together whenever the points are divided. The process is illustrated in Fig. 1. Since projections are costly to compute, we use the same random lines $\mathcal {L}$ and the computed projections of points onto these lines across multiple partitionings.

For a single partition, we start with the entire point set. We split it recursively into two parts until the size of the point set is at most $\textit{minSize}+1$, where $\textit{minSize}$ is a parameter of the algorithm. To split the points, the projected values of points onto a random line are used, ie. a projected value of one of the projected points is chosen uniformly at random. All points with a projected value smaller than that of the point chosen constitute one part and the remainder the other part. In principle, one could also split based on distance, i.e., pick a point randomly on the projection line that lies between the projected point of minimum and maximum value. However, this might create sets that only contain points of one cluster. This yields infinite distances between clusters, because no distance will be computed for points stemming from different clusters. For example, if there are three very dense clusters on one line, then using a distance-based splitting criterion will give the following: The first random projection will likely yield one set containing all points of one cluster and one set containing all points of the other two clusters. It is probable that these two clusters being in one set are split into two separate sets in the second projection. From then on, all further partitionings are within the same cluster. Thus, all clusters are assumed to have infinite distances from each other, although all clusters are on the same line and might have rather different distances to each other. Using our splitting criterion for this scenario yields that (most likely) some pair of points from different clusters will be considered.

More formally, the MultiPartition algorithm chooses a sequence $\mathcal {L}:=(L_0,L_1,\ldots )$ of $c_L \log N$ random lines. It projects the points on each random line $L_i$ in the sequence $\mathcal {L}$ giving a set $L_{i,\mathcal {P}}$ of projected values for each line $L_i$. The sequence of all these sets of projected values is denoted by $\mathfrak {P}:=(L_{0,\mathcal {P}},L_{1,\mathcal {P}},\ldots )$. First, the points ${\mathcal {S}}$ are split into two disjoint sets ${\mathcal {S}}^0_0 \subseteq {P}$ and ${\mathcal {S}}^0_1$ using the value $r_s:=L_0\cdot A$ of a randomly chosen point $A \in \mathcal {S}$. The set ${\mathcal {S}}^0_0$ contains all points ${P} \in {\mathcal {S}}$ with smaller projected value than the number $r_s$ chosen, i.e., ${Q}\cdot L_0\le r_s$, and the other points ${\mathcal {P}}\setminus {\mathcal {S}}^0_0$ end up in ${\mathcal {S}}^0_1$. Afterwards, recurse on sets ${\mathcal {S}}^0_0$ and ${\mathcal {S}}^0_1$, that is, for line $L_1$ we first consider set ${\mathcal {S}}^0_0$ and split it into sets ${\mathcal {S}}^1_0$ and ${\mathcal {S}}^1_1$. Then, a similar process is used on ${\mathcal {S}}^0_1$ to obtain sets ${\mathcal {S}}^1_2$ and ${\mathcal {S}}^1_3$. For line $L_2$, we consider all four sets ${\mathcal {S}}^1_0, {\mathcal {S}}^1_1, {\mathcal {S}}^1_2$ and ${\mathcal {S}}^1_3$. The recursion ends once a set $\mathcal {S}$ contains fewer than $\textit{minSize}+1$ points. We compute the union of all sets of points resulting from any partitioning for any of the projection sets $\mathcal {L} \in \mathfrak {P}$. Techniques equivalent to algorithm Partition have been used in the RP-tree (Dasgupta and Freund 2008).

Theorem 3

For a d-dimensional dataset, algorithm Partition runs in $O(N \log N)$ time whp.

Essentially, the theorem says that we need $O(\log N)$ projections of all points. If we were to split a set of N points into two sets of equal size N / 2 then it is clear that $\log N$ projections would be sufficient, because after that many splits the resulting sets are only of size 1, i.e., $N/2/2/2\ldots = N/2^{\log N} = 1$. Therefore, the proof deals mainly with showing that this also holds when splitting points are chosen randomly.

Proof

The number of random lines required until a point P is in a set of size smaller than $\textit{minSize}+1$ is bounded as follows: In each recursion, the given set $\mathcal {S}$ is split into two sets ${\mathcal {S}}_0,{\mathcal {S}}_1$. By $p(E_{|{\mathcal {S}}|/4})$ we denote the probability of event $E_{|{\mathcal {S}}|/4}:= \min (|{\mathcal {S}}_0|,|{\mathcal {S}}_1|)\ge |{\mathcal {S}}|/4$ that the size of both sets is at least 1 / 4 than that of the total set. As the splitting point is chosen uniformly at random, we have $p(E_{|{\mathcal {S}}|/4})=1/2$. Put differently, the probability that a point P is in a set of size at most 3 / 4 of the overall size $|{\mathcal {S}}|$ is at least 1 / 2 for each random line L. When projecting onto $|\mathcal {L}|=c_L\cdot \log N$ lines, we expect $E_{|{\mathcal {S}}|/4}$ to occur $c_L\cdot \log N/2$ times. Using Theorem 1, the probability that there are fewer than $c_L\cdot \log N/4$ occurrences is

$$\begin{aligned} e^{-c_L\cdot \log N/48}=1/N^{c_L/48} . \end{aligned}$$

For a suitable constant $c_L$, we have

$$\begin{aligned} N\cdot (3/4)^{c_L\cdot \log N/4} <1 . \end{aligned}$$

Therefore, the number of recursions until point P is in a set ${\mathcal {S}}$ of size less than $\textit{minSize}+1$ is at most $c_L\cdot \log N$ whp. Using Theorem 2 this holds for all N points whp. Thus, the time to compute $|\mathcal {L}|=c_L\cdot \log N$ projections is $O(N \log N)$ whp. $\square $

Theorem 4

Algorithm MultiPartition runs in $O((d+\log N)N\log N)$ time whp.

Proof

For each random line $L_j \in \mathcal {L}$, all N points from the d-dimensional space are projected onto the random line $L_j$, which takes time O(dN). We compute $c_L\log N$ projections. Additionally, Algorithm MultiPartition calls Algorithm Partition $c_p (\log N)$ times thus using Theorem 2 concludes the proof. $\square $

5 Neighborhood

Using the data partitioning described above, we compute, for each point, a neighborhood consisting of nearby points and an estimate of density. Each set resulting from the data partitioning consists of nearby points. Thus, potentially, all points in a set are neighbors of each other. However, looking at a set as a clique of points results in an excessive computation and memory overhead, because the distances for all pairs of neighboring points must be computed.

OPTICS (see Sect. 7.1) uses the idea of core points. If a core point has sufficiently large density then the core point and all its closest neighbors NC form a cluster, irrespective of the neighborhood of the points NC near the core point. This motivates the idea to pick only a single point per set, call it center, and add the other points of the set to the neighborhood of the center (and the center to the neighborhood of all points). If the center is dense enough, it and its neighbors are in the same cluster. Another motivation to pick a single point and add all points to its neighborhood is that this gives a connected component with the minimum number of edges (see Fig. 2). More precisely, the single point picked from a set S of points has $|S|-1$ edges. Picking $|S|-1$ edges randomly reduces the probability that the graph is connected, e.g., picking edges randomly results in the creation of triangles of nearby nodes.

To reduce run-time, one may consider to evaluate all pairwise distances only for a single random projection and (potentially) perform fewer random projections overall. Although this seems feasible, further reducing the number of projections to asymptotically below $\log n$ (i.e., $o(\log n)$) poses a high risk of obtaining inaccurate results, because a single random projection only preserves distances in expectation and, therefore, a minimum number of projections is necessary to obtain stable and accurate neighborhoods. More precisely, using only a few random projections likely creates neighborhoods that consist of points that are actually far from each other, i.e., that should not be considered neighbors, and points that are not in the same neighborhood although they are close to each other.

To summarize the neighborhood creation process: A sequence $\mathcal {S} \in \mathfrak {S}$ is an ordering of points projected onto a random line (see Fig. 1). For each sequence $\mathcal {S} \in \mathfrak {S}$, we pick a random point, i.e., a center point $P_{Center}$. For this point, we add all other points $\mathcal {S}\setminus P_{Center}$ to its neighborhood $\mathcal {N}(P_{Center})$. The center $P_{Center}$ is added to the neighborhood $\mathcal {N}(P)$ of all points $P \in \mathcal {S}\setminus P_{Center}$. The pseudocode is given in Algorithm 3.

The next theorem elaborates on the size of the neighborhood created. In the current algorithm, we only ensure that the size of the neighborhood is at least ${\varOmega }(\min (\textit{minSize},(\log N)))$. Thus, for a large parameter $\textit{minSize}$, i.e., $\textit{minSize} \gg \log N$, the size of the neighborhood might be smaller than $\textit{minSize}$. In this situation, the neighborhood would be a sample of size roughly $\log N$ of close points. To get a larger neighborhood, it is possible to pick more than one center per set in Algorithm 3.

Theorem 5

For the size $|\mathcal {N}(A)|$ of a neighborhood $\mathcal {N}(A)$ for every point A holds $|\mathcal {N}(A)| \in {\varOmega }(\min (\textit{minSize},(\log N)))$ whp and $|\mathcal {N}(A)| \in O(\log N\cdot \textit{minSize})$.

Proof

The size of the neighborhood of a point A can be bounded by keeping in mind that the entire point set is split $c_p(\log N)$ times into sets of size at most $\textit{minSize}$. For each final set of size at most $\textit{minSize}$ a point may receive $\textit{minSize}-1$ new neighbors. This yields the upper bound. A point A gets at least one neighbor for the first set. From then on, for every final set that is the result of the partitioning process, a new point might either be added to the neighborhood or a point chosen might already be in the neighborhood. Algorithm MultiPartition performs $c_p(\log N)$ calls to algorithm Partition. For each call, we obtain a smallest set $\mathcal {S}_A$ containing A. Define $\mathfrak {S}_A\subset \mathfrak {S}$ to be the union of all sets $A \in \mathcal {S}_A \in \mathfrak {S}$ containing A. Before the last split of a set $\mathcal {S}_A$ resulting in the sets $\mathcal {S}_{1,A}$ and $\mathcal {S}_2$, the set $\mathcal {S}$ must be of size at least $c_m\cdot \textit{minSize}$; the probability that splitting it at a random point results in a set $\mathcal {S}_A$ with $|\mathcal {S}_A|<c_m/2\cdot \textit{minSize}$ is at most 1/2. Thus, using a Chernoff bound 1, at least $c_p/8\log N$ sets $\mathcal {S}_A \in \mathfrak {S}_A$ are of size at least $c_m/2\cdot \textit{minSize}$ whp. Assume that the current number of distinct neighbors $\mathcal {N}(A)$ for A is smaller than $\min (c_m/4\cdot \textit{minSize},c_p/16 (\log N))$. Then, for each of the $c_p/8(\log N)$ final sets $\mathcal {S}_A$ the probability that a new point is added to $\mathcal {N}(A)$ is at least 1/2. (Note that we cannot guarantee that final sets resulting from the partitioning are different.) Thus, we expect that at least $\min (c_m/4\cdot \textit{minSize},c_p/16(\log N))$ points are added (given $|\mathcal {N}(A)|<c_m/4\cdot \textit{minSize}$). The probability that we deviate by more than 1/2 of the expectation is $1/N^{c_p/96}$ using Theorem 1 for point A and $1/N^{c_p/96-2}$ for all points using Theorem 2. Therefore, for every neighborhood, it is at least $\min (c_m/8\cdot \textit{minSize},c_p/32(\log N))$ whp. $\square $

The time complexity is dominated by the time it takes to compute the partitioning of points (see Theorem 4). Once we have the partitioning, the time is linear in the number of points per set: For example, hash tables requiring O(1) for inserts and finds (to check whether a neighbor is already stored) can be used to implement the neighborhood construction in lines 4 and 6 in Algorithm 3.

Corollary 1

The neighborhood of all points $A \in \mathcal {P}$ can be computed in time $O((d+\log N )N\log N)$.

The neighborhood may not contain some close points, but rather some more distant points as shown in Fig. 3. We discuss the details and guarantees in Sect. 7.3.

6 Density estimate

To compute the density at point A one needs to measure the volume containing a fixed amount of points. This volume is roughly $r^d$, where d is the dimension and r is the radius of a ball that is required to include a fixed number of points $\textit{minPts}$. More precisely, the radius r is the distance to the $\textit{minPts}$-th point. The density is then a function of $1/r^d$.

Density-based clustering algorithms, such as OPTICS, cluster points based on distances r rather than densities $1/r^d$. Reasons for this are:

It is computationally faster.
It avoids a potential division by zero.
For large d, even small changes in r would yield large differences in density. Therefore, a transformation would be required when visualizing densities of points.

A key question is how the number of points $\textit{minPts}$ used to compute the volume relates to the minimum size threshold $\textit{minSize}$ such that a set is split in Algorithm MultiPartition.

To compute a density estimate for a point we should have its $\textit{minPts}$-nearest neighbors. Therefore, for a single partitioning splitting a set if it is at least of size $\textit{minSize} \ge \textit{minPts}+2$ seems a natural lower bound. For a set of size $\textit{minPts}+2$ at least one point is removed by a split leaving $\textit{minPts}+1$ points in a final set. For such a set there could be one or more points such that it contains the $\textit{minPts}$ closest neighbors. When performing multiple partitionings, less points might suffice, since each final set for a partitioning is essentially a random set of generally nearby points. In other words, the union of the sets from different partitionings (each of size less than $\textit{minPts}$) might still contain all $\textit{minPts}$ nearest neighbors. Thus, there is a relationship between the number of partitionings and the minimum size $\textit{minSize}$ a set gets split to get the nearest neighbors. We quantify it in the analysis. In the practical evaluation we used for simplicity $\textit{minSize} = minPts$.

Algorithm 4 states the density and neighborhood computation. For the theoretical analysis of Algorithm 4 (given later), we fix $\textit{minSize}$ to $c_m\cdot \textit{minPts}$ for some constant $c_m$ determined in the analysis.

6.1 Density estimate

The original OPTICS algorithm measures density of a point as the inverse to its $\textit{minPts}$-nearest neighbors. The density is indifferent to the distribution of the $\textit{minPts}-1$ closest points, as well as all points that are further away than the $\textit{minPts}$ point from A. In some cases this might yield unnatural density estimates, as shown in Fig. 4 because of the high sensitivity on the number of fixed points $\textit{minPts}$. For all three cases points A, B have circles of equal radius (and, thus, equal density) in Fig. 4. For the distribution of points on the left, this seems plausible. In the middle distribution, A should be of larger density, because all points except one are very near to A, thus changing $\textit{minPts}$ by one has a large impact. On the right-hand side, A also appears denser because it has many points that are just at marginally larger distance than the $\textit{minPts}$ closest point.

Therefore, it seems a reasonable alternative to consider the distances of several points. We compute an average involving several points, e.g., using the $(1-f)\cdot \textit{minPts}$ to $(1+f)\cdot \textit{minPts}$ closest points for constant $f\in [0,1]$ or a sample of nearby points. This yields a less sensitive density estimate. However, we do not see our estimate as superior in capturing the meaning of density, but see it as another option to estimate density. Before discussing its property let us formally define the density of a point and the average distance Davg(A) of a point A, which depends on the neighbors $\mathcal {N}(A)$.

Definition 1

The set of neighbors $\mathcal {N}_{f}(A)$ is a subset of neighbors $\mathcal {N}(A)$ given by all points with distances between the $(1-f)\textit{minPts}$-closest point $C_0$ and the $(1+f)\textit{minPts}$-closest point $C_1$ in $\mathcal {N}(A)$ for constant $f\in [0,1]$.

$$\begin{aligned} \mathcal {N}_{f}(A):=\{B \in \mathcal {N}(A)| D(A,C_0) \le D(A,B)\le D(A,C_1)\} \end{aligned}$$

Note, for f close to 1, we use A as $C_0$, i.e., the 0-th nearest neighbor of A is A itself.

Definition 2

The average distance Davg(A) of a point A is the average of the distances from A to each point $P \in \mathcal {N}_{f}(A)$:

$$\begin{aligned} Davg(A):= \sum _{B \in \mathcal {N}_{f}(A)} D(A,B)/|\mathcal {N}_{f}(A)| \end{aligned}$$

Definition 3

The density at a point A is the inverse of the average distance, i.e., 1 / Davg(A).

To give some more intuition assume $f=1$, ie. we average over the $2\cdot \textit{minPts}$ closest points. Furthermore, assume a uniform distribution of points in d dimensional space $[0,1]^d$. In one dimension $d=1$ the $\textit{minPts}$ nearest neighbor of a point A is in expectation at distance $\textit{minPts}/N/2$, since two points are separated in expectation by about 1 / N and we have points on either side of A. Averaging over the $2\cdot \textit{minPts}$ nearest neighbors yields double the expectation. For very large d and a point A the distance Davg(A) is decreasing relative to the k-th nearest neighbor, since the number of points, ie. volume, grows polynomial in d with distance from A. For illustration, within a distance r we expect points proportional to $r^d$ within distance 2r we expect points proportional to $2^d\cdot r^d$. However, even this case our average distance Davg(A) is at most a factor two smaller than the distance to the $\textit{minPts}$-nearest neighbor. (We discuss the more complex upper bound for Davg(A) in Theorem 12).

Algorithm 4 states the density and neighborhood computation for $f=1$. Note, in principle it is possible (though as we shall discuss not very likely) that a point A has less than $(1+ f)\textit{minPts}$ (or even $(1- f)\textit{minPts}$) neighbors $\mathcal {N}(A)$, eg. due to an unfortunate splitting of the point set. In this case, we can only compute the distance to the $|\mathcal {N}(A)|$-closest neighbor in $\mathcal {N}(A)$ rather than the $(1\pm f)\textit{minPts}$-nearest one.

7 Density-based clustering using reachability

We apply our ideas to speed up the computation of an ordering of points, i.e., OPTICS (Ankerst et al. 1999).

7.1 OPTICS

Ordering points to identify the clustering structure (OPTICS) (Ankerst et al. 1999) defines a sequence of all points and a distance for each point. This enables an easy visualization to identify clusters. Similarity between two points A, B is measured by computing a reachability distance. This distance is the maximum of the Euclidean distance between A and B and the core distance (or density around a point), i.e., the distance of A to the $\textit{minPts}$-th points, where $\textit{minPts}$ corresponds to the minimum size of a cluster. Thus, any point A is equally close (or equally dense) to B if A is among the $\textit{minPts}$-nearest neighbors of B. If this is not the case then the distance between the two points matters. The algorithm comes with a parameter $\epsilon $ that impacts performance. Parameter $\epsilon $ states the maximum distance for which we look for the $\textit{minPts}$-closest neighbors. Using $\epsilon $ equal to the maximum distance of a point therefore requires the computation of all pair-wise distances without the use of sophisticated data structures. Choosing $\epsilon $ very small may not cluster any points, as the neighborhood of any point is empty, i.e., all points have zero density. A core point is a point that contains at least $\textit{minPts}$ points within distance $\epsilon $.

The algorithm maintains a list of point pairs sorted by their reachability distance. It chooses a point A with minimum reachability distance (if the list is non-empty, otherwise it chooses an arbitrary point and uses “undefined” as reachability distance. It marks the point as processed and updates the list of point-wise distances by computing the reachability distance from A to each neighbor. It updates or inserts pairs of points (with their corresponding distance) consisting of A and the neighbors of A if a pair of points has not already been processed or if the newly computed distance is smaller.

7.2 SOPTICS: speedy OPTICS

Our algorithm for density-based clustering, SOPTICS, introduces a fast version of OPTICS which exploits the pre-processing elaborated previously to discover the neighborhood of each point.^{Footnote 1} The processing of points is the same as for OPTICS, aside from the neighborhood computation as shown in Algorithm 5 (line 4). A key difference is that we do not need a parameter $\epsilon $ as in OPTICS.

Algorithm 5 provides pseudocode for SOPTICS. Essentially, we maintain an updatable heap, in which each entry consists of a reachability distance of a point A and the point A itself. The heap is sorted by reachability distance. For initialization an arbitrary point is put on the heap with undefined reachability distance. Afterwards, repeatedly a point (with shortest reachability distance) is polled from the heap and marked as processed before the reachability distance of all its non-processed neighbors is computed and either inserted into the heap or an existing entry for that point is updated.

We have discussed the neighborhood computation in Sect. 5. Thus, let us now discuss in more detail how we deal with parameter $\epsilon $. OPTICS requires to set a parameter $\epsilon $ that balances performance and accuracy. A small parameter results in the core distance being undefined for many points. Therefore, the clustering result would not be very meaningful. For OPTICS, to give exact results according to the core-distance definition $\epsilon $ must be at least the maximum distance to the $\textit{minPts}$-nearest neighbor. Such an approach represents a good compromise between performance and accuracy. However, this can result in drastic performance penalties in the case of uneven point densities. Let’s see it with an example: Assume that there is a dense area of points of diameter 10, for which $\epsilon =1$ would suffice for optimal accuracy and a significantly less dense area, which requires $\epsilon = 10$. Choosing $\epsilon =10$ means that OPTICS computes all pairwise distances of points within the dense cluster, whereas for $\epsilon = 1$ it might compute only a small fraction of all pairwise distances. Therefore, it would be even better to define $\epsilon $ depending on a point A, i.e., $\epsilon (A)$ is the distance to the $\textit{minPts}$-nearest neighbor. Using our random projection-based approach, we do not define $\epsilon $ directly, but we set $\textit{minPts}$, determining the size of a set of points that is used for the computation of the core distance. Intuitively, for each point, we would like to know its $\textit{minPts}$-closest neighbor. Assuming a set computed by our random projections indeed contains nearest neighbors, we have $\textit{minSize} \approx \textit{minPts}$ (see discussion in Sect. 6). Specifying the number of points per set presents a more intuitive approach than using a fixed distance for all points, because it can be set to a fixed value for all points to yield maximal performance, while maintaining the best possible accuracy.

Theorem 6

Algorithm SOPTICS runs in $O((d+\log N)N\log N)$ time whp. It requires O($N(d+\log N\cdot \textit{minSize})$) memory.

Proof

Computing all neighbors requires $O((d+\log N)N\log N)$ time whp according to Theorem 1. The average number of neighbors is at most $\log N$ per point. The size of the heap is at most N. For each point, we consider each of the most $\log N$ neighbors at most once. Thus, we perform $O(N\log N)$ heap operations and also compute the same number of distances. This takes time $O(N\log N\cdot (d+\log N))$. Storing all d-dimensional points in memory requires $N\cdot d$. Storing the neighborhoods of all points requires O($N\log N\cdot \textit{minSize}$). $\square $

In Fig. 5 we provide a visual illustration of the reachability plots computed on one dataset for OPTICS and SOPTICS. It is apparent that both techniques reveal the same cluster structure. The following definitions are analogous as for OPTICS, but we use the average distance Davg(A) rather than the $\textit{minPts}$-nearest neighbor distance. The original definition of core distance in OPTICS is undefined if there are less than $\textit{minPts}$ neighbors within some radius $\epsilon $. Otherwise the core distance is the distance to the $\textit{minPts}$-th nearest neighbor. In contrast, our core distance is always defined.

Definition 4

The core distance of a point A equals the average distance Davg(A).

Two points are reachable if they are neighbors, i.e., one of the two points must be in the neighborhood of the other.

The definition of the reachability distance for a point A and a reachable point B from A is the same as for OPTICS. However, for a point A, we only compute the reachability distance to all neighbors $B \in \mathcal {N}(A)$.

Definition 5

The reachability distance Dreach(A, B) is the maximum of the core distance of A and the distance of A and B, i.e., $Dreach(A,B):=\max (Davg(A),D(A,B))$.

Note that the reachability distance is non-symmetric, i.e., in general $Dreach(A,B) \ne Dreach(B,A)$.

7.3 Theoretical analysis

Now we state our main theorems regarding the complexity of the techniques presented. Our algorithm strongly relies on the well-known Johnson–Lindenstrauss Lemma, which states that, if two points are projected on a randomly chosen line, the distance of the projected points on the line corresponds to the scaled distance of the non-projected points, in expectation. Higher-dimensional spaces can in general not be embedded in one dimension without distortion, so the above only holds in expectation. The expected scaling factor of the distance between two points in the original (high) dimensional space and in the one dimensional projected space (on the random line) is the same for all points. It is proportional to $1/\sqrt{d}$, i.e., one over the squared root of the original data dimensionality.

We state and prove theorems that show that we retrieve at least some close neighbors for every point, but not necessarily all nearest neighbors. We state a bound for the $\textit{minPts}$-nearest neighbor distance. This allows us to relate the core distances of OPTICS and SOPTICS. It also helps to get a bound for our density estimate Davg(A) for a point in ${\mathbb {R}}^d$, so we relate Davg(A) and the average of the distance to the $\textit{minPts}$-nearest neighbors of a point A.

Specifically, in Theorem 8, we prove that only a small fraction of points have a projected length that is much longer or shorter than its expectation for a randomly chosen line. This enables us to bound the probability that a projection and a splitting-up of points will keep close points together and separate distant points as shown in Theorem 9. Therefore, after a sequence of partitionings, we can pick a point randomly for each set containing A to include in the neighborhood $\mathcal {N}(A)$ such that at least some of the points $\mathcal {N}(A)$ are close to A (Theorem 10) given that there are some more points near A than just $\textit{minPts}$. The latter condition stems from the fact that we split a set by the number of points in the set and not by distance. We discuss it in more detail before the theorem. Theorem 11 explicitly gives a bound on the distance to the $\textit{minPts}$-nearest neighbor. Finally, in Theorem 12, we relate our computed density measure and the one of OPTICS by showing that they differ at most by a constant factor.

We require a “mild” upper bound on the neighborhood size of a point. The reason being that for a given point, points distant from it are very likely removed compared with very near points when splitting a set. But points that are somewhat near are not removed much more likely than very close points. Thus, if there are too many of them, we need many splits to remove them all and the odds that we remove also a lot of nearby points becomes large.^{Footnote 2}

More mathematically, we require an upper bound on the number of points that are within a certain distance of A. For point A the bound depends on the distance to the nearest neighbor, i.e., r. The number of points should not grow more than some exponential function of r. More precisely, with all details being clarified during this section, consider a multiple of the nearest neighbor distance r, ie. $(f_g)^{3/2+c_s}\cdot (f_d\cdot r)$ for an arbitrary integer $f_g$, a value $f_d>1$ and a small constant $c_s>0$. The number of points within that interval for $f_g\ge 1$ is allowed to be at most $2^{\sqrt{f_g}}\cdot |\mathcal {N}(A,f_d\cdot r)|$ points. We require for point A

$$\begin{aligned} \left| \left\{ B \in \mathcal {S}| D(A,B)\le (f_g)^{3/2+c_s}\cdot f_d\cdot r\right\} \right| \le 2^{\sqrt{f_g}}\cdot |\mathcal {N}(A,f_d\cdot r)| \end{aligned}$$

(1)

Let $\mathcal {S}_A$ be a set of points containing point A, i.e., $A \in \mathcal {S}_A$, being projected onto a random line L. We distinguish between three point sets. They correspond to close points, points that are somewhat close and far away points. Ideally, for a projection we would like to have that for any k the k-nearest points on the projection line correspond to the k-nearest points in the d-dimensional space. However, generally this does not hold, but we prove that most close points are closer to A than most far away points. Therefore when choosing a splitting point uniformly at random, the resulting set containing A is likely to contain almost all close points but only a constant fraction of far away points.

We defined the three point sets ‘close, somewhat close and far away’ points non-disjointly for ease of notation throughout the proofs:

i)
Points close to A, i.e., within radius r, i.e, $\mathcal {S}_A \cap \mathcal {N}(A,r)$, where $\mathcal {N}(A,r)$ are the points within radius r from point A.
ii)
Points distant from A, i.e, $\mathcal {S}_A\setminus \mathcal {N}(A,c_dr)$, for some constant $c_d>80$ (defined later).
iii)
Points $\mathcal {N}(A,c_dr)$ that consist of close and somewhat close points.

For these three sets, we prove in Theorems 7 and 8 that only for a few close points will the distance of their projections onto a random line be much larger than the expectation, quantified by random variable $X^{long}_A$, and that for only few distant points will their projections be much smaller than the expected projection, quantified by $X^{short}_A$.

Let event $E^{long}_A(C)$ be the event that for a randomly chosen line L the projected length $(C-A)\cdot L$ of a close point $C \in \mathcal {N}(A,r)$ is more than a factor $\log (c_d)$ of the expected projected length ${\mathbb {E}}[(C-A)\cdot L]$. Let $X^{long}_A$ be the random variable giving the number of all occurred events $E^{long}_A(C)$ for all points $C\in \mathcal {N}(A,r)$.

Let event $E^{short}_A(C)$ be the event that for a randomly chosen line L the projected length $(C-A)\cdot L$ of a distant point $C\in \mathcal {S}_A\setminus \mathcal {N}(A,c_dr)$ is less than a factor $2\log (c_d)/c_d$ of the expected projected length ${\mathbb {E}}[(C-A)\cdot L]$. Let $X^{short}_A$ be the random variable giving the number of all occurred events $E^{short}_A(C)$ for all points $C\in \mathcal {S}_A\setminus \mathcal {N}(A,c_dr)$.

Theorem 7

For every point C holds $p(E^{short}_A(C))\le 3\log (c_d)/c_d$ and $p(E^{long}_A(C)) \le 2/\log (c_d)e^{-\log (c_d)^2/2}$ for some value $c_d$.

Proof

The probability of event $E^{short}_A(C)$ can be bounded using Lemma 5 (Dasgupta and Freund 2008): $p(E^{short}_A(C))\le 3\log (c_d)/c_d$. Using again Lemma 5 from Dasgupta and Freund (2008) for $E^{long}_A(C)$ we have $p(E^{long}_A(C)) \le 2/\log (c_d)e^{-\log (c_d)^2/2}$.

$\square $

Theorem 8 states that for most points there is not too much deviation from the expectation. More precisely, for most close points (as well as for most distant points) the distances of projected close (as well as distant) points is not much longer (shorter) than the expectation.

Theorem 8

For points $\mathcal {S}_A$ projected onto line L chosen randomly from $\mathcal {L}$ define event $E':= (X^{short}_A< |\mathcal {S}_A\setminus \mathcal {N}(A,c_dr)|/\log ({c_d})) \wedge (X^{long}_A < |\mathcal {S}_A \cap \mathcal {N}(A,r)|/(c_d)^{\log (c_d)/3} )$ We have

$$\begin{aligned} p(E') \ge (1-4\log (c_d)^2/{c_d})^2. \end{aligned}$$

Proof

Can be found in the appendix. $\square $

The proof works in the same fashion for $X^{short}_A$ and $X^{long}_A$. We discuss the main ideas using $X^{short}_A$. The proof computes a bound on the expectation of $X^{short}_A$ by using linearity of expectation to express the expectation of $X^{short}_A$ in terms of the expectation of individual events that are upper-bounded using Theorem 7. To bound the probability that $X^{short}_A$ does not exceed the upper bound of the expectation, we use Markov’s inequality. The proof also needs to deal with the fact that we repeatedly choose lines randomly from a small subset of all random lines, ie. $\mathcal {L}$. Thus, there are dependencies. We show that these are rather weak, since we choose sufficiently many random lines allowing us to give high probability bounds using the Chernoff Bound for dependent events (see Theorem 1).

The next theorem shows that a set resulting from the partitioning is likely to contain some nearby points. The proof starts by looking at a single random projection and assumes that there are only relatively few non-distant points left in the set containing A. It shows that it is likely that distant points from A are removed whenever a set is split, whereas it is unlikely that points near A are removed. Therefore, for a sequence of random projections, we can prove that some nearby points will remain and many more distant points are removed. On the technical side, the proof uses elementary probability theory.

Theorem 9

For each point A, for at least $c_p/16 (\log N)$ sets $\mathcal {S}_A$ resulting from a call to algorithm Partition it holds

$$\begin{aligned} |\mathcal {S}_A \cap \mathcal {N}(A,r)|/|\mathcal {N}(A,r)|>2/{c_{d}} \end{aligned}$$

Proof

Can be found in the appendix. $\square $

The next theorem shows that the computed neighborhood for a point A contains at least some points “near” A. It contains a restriction on the parameter $\textit{minPts}$ that is mainly due to neighborhood construction but could be eliminated by using a larger parameter $\textit{minSize}$. The proof bounds the number of sets resulting from the partitioning that are at least of a certain size using a Chernoff bound. Then we compute the probability that for a point A a new nearby point is chosen as a neighbor.

Theorem 10

The neighborhood $\mathcal {N}_A$ computed by Algorithm 3 for a point A contains at least $2\cdot \textit{minPts}$ points for $\textit{minPts} < c_m \log N$ that are within distance $D_{c_m \cdot \textit{minPts}}(A)$ from A.

Proof

Can be found in the appendix. $\square $

Let us bound the approximation of the $\textit{minPts}$-nearest-neighbor distance when using the neighborhood computed by Algorithm 3.

Theorem 11

Let $D_{\textit{minPts}}(A)$ be the distance to the $\textit{minPts}$-nearest neighbor. For the distance ${\tilde{D}}_{\textit{minPts}}(A)$ to the $\textit{minPts}$-nearest neighbor in the neighborhood $\mathcal {N}(A)$ computed by Algorithm 3 holds $D_{\textit{minPts}}(A) \le {\tilde{D}}_{\textit{minPts}}(A) \le D_{ c_m \textit{minPts}}(A)$ for suitable constant $c_m$ whp.

Proof

The lower bound follows from bounding the minimum size of $\mathcal {N}(A,r)$. Using Theorem 10, 2minPts points are contained within distance $D_{c_m \textit{minPts}}(A)$. The smallest value of ${\tilde{D}}_{\textit{minPts}}(A)$ is reached when $\mathcal {N}(A)$ contains all $\textit{minPts}$-closest points to A, which implies ${\tilde{D}}_{\textit{minPts}}(A) = D_{\textit{minPts}}(A)$. For the upper bound due to Theorem 10, $\mathcal {N}_A$ contains at least 2minPts within distance $D_{c_m \cdot \textit{minPts}}(A)$. Thus, the $\textit{minPts}$-nearest point in $\mathcal {N}_A$ is at most at distance $D_{c_m \cdot \textit{minPts}}(A)$. $\square $

Assume that the k-nearest-neighbor distance is not increasing very rapidly, when increasing the number of points considered, i.e., k. More formally, assume that there exists a sufficiently large constant $c>1$, such that $D_{c\cdot \textit{minPts}}(A) \le c \cdot D_{\textit{minPts}}(A)$. Then, we compute a constant approximation of the nearest-neighbor distance. This condition is generally satisfied if clusters are significantly larger than $\textit{minPts}$. Since in d dimensional space in a cluster of roughly constant density the number of points within radius r increases much faster than linearly with distance, ie. by a factor of $2^d$ when doubling the radius. However, our theoretical analysis also says that we might overestimate the distance to the k-th nearest neighbor in case a (dense) cluster has just $\textit{minPts}$ points surrounded by a large very sparsely populated volume of points.

Next, we relate the core distance of OPTICS (see Sect. 7.1), i.e., the distance to the $\textit{minPts}$-nearest neighbors of a point A, and of SOPTICS, i.e., Davg(A).

Theorem 12

For every point $A \in {\mathbb {R}}^d$, $D_{\textit{minPts}}(A)/2\le Davg(A) \le D_{c_m \textit{minPts}}(A)$ holds for constant $c_m$ and $f=1$ whp.

Proof

To compute Davg(A) with $f=1$, we consider the $(1+f)\cdot \textit{minPts}= 2 minPts $ closest points to A from $\mathcal {N}(A)$. Using Theorem 10, 2minPts points are contained in $\mathcal {N}(A)$ with distance at most $D_{c_m \textit{minPts}}(A)$. This yields $Davg(A)\le D_{c_m \textit{minPts}}(A)$. Thus, the upper bound follows. To compute Davg(A), we average the distance using the $2\cdot \textit{minPts}$-closest points to A. Thus, $\textit{minPts}$ points must have distance at least $D_{\textit{minPts}}(A)$. The other $\textit{minPts}$ points could be at distance (almost) 0 from A. Thus, $Davg(A)\ge \frac{minPts\cdot D_{\textit{minPts}}(A)+ minPts\cdot 0}{2\cdot \textit{minPts}} = D_{\textit{minPts}}(A)/2$. $\square $

Assume that the average distance to the $\textit{minPts}$-nearest neighbor is an equivalently valid density measure compared with the distance of the $\textit{minPts}$-th neighbor used by OPTICS. Typically, the cluster size is significantly larger than $\textit{minPts}$ and the density within clusters is not varying very rapidly when looking at a point and some nearest neighbors. In this case, we compute an O(1) approximation of the density, i.e., core distance, used by OPTICS. This is fulfilled if the distances to the $\textit{minPts}$-th up to the $(c_m \textit{minPts})^{th}$ point do not increase by more than a constant factor compared with the $\textit{minPts}$-closest point. More technically, we require the existence of a (sufficiently large) constant c such that $\forall A \in \mathcal {P}: {D}_{minPts\cdot c}(A) = c\cdot {D}_{\textit{minPts}}(A)$.

8 Empirical evaluation

Here we evaluate the runtime and clustering quality of the proposed random-projection-based technique. The SOPTICS algorithm has been implemented in Java^{Footnote 3} using both density measure, ie. based on classical OPTICS measure and the average of (nearest) neighbour distances. We compare its performance with that of OPTICS with and without LSH index (Datar et al. 2004) and DeLi-Clu, from the Elki Java Framework (Schubert et al. 2015)^{Footnote 4} using version 0.7.0 (2015, November 27). DeLi-Clu represents an improvement of OPTICS that leverages indexing structures (e.g., R*-trees) to improve performance. OPTICS with an LSH index it also a good baseline comparison, because it allows one to support very fast nearest-neighbor queries. All experiments have been conducted on a 2.5GHz Intel^{Footnote 5} CPU with 8GB RAM.

8.1 Datasets

We use a variety of two-dimensional datasets typically used for evaluating density-based algorithms as well as high-dimensional data sets to compare the performance of the algorithms for increasing data dimensionality. A summary of the datasets is given in Table 2. We did not apply any particular preprocessing to the datasets; for all algorithms, we measured the time, once the data set had been read into memory.

Implementation details The source code of SOPTICS is available at the second author’s website,^{Footnote 6} and can also be found in the ELKI Framework, as of version 0.7.0 (Schubert et al. 2015).^{Footnote 7}

Parameter setting OPTICS requires the parameters $\epsilon $ and $\textit{minPts}$. When not using an index, $\epsilon $ is set to infinity, which provides the most accurate results; $\textit{minPts}$ depends on the dataset. When using an LSH index, setting the parameters is non trivial. For parameter $\epsilon $ of OPTICS, we used the smallest $\epsilon $ that is needed to get an accurate result, i.e., the maximum distance to the $\textit{minPts}$-nearest neighbor of a point of a dataset. The LSH index requires three main parameters: number of projections per hash value (k), number of hash tables (l) and the width of the projection (r). For parameters k and l, we performed a grid search using values between 10 and 40. We kept both parameters at a value of 20 because it returned the best results. The width r should be related to the distance of the maximum $\textit{minPts}$-nearest neighbor, i.e., ideally a bin contains the $\textit{minPts}$-nearest neighbor of a point and it should also depend on the dimension, because distances are scaled by the square root of the dimension. Thus, in principle, the maximal distance of any point to the $\textit{minPts}$-nearest neighbor should roughly suffice to get the same results as for OPTICS with $\epsilon = \infty $ (up to some constant factor, i.e., values below 8 did not yield good results; for the “Musk” benchmark we used 64). Starting from this assumption, we tried to find the best possible value (up to a factor of 2). We compared the plots of the outcomes with OPTICS with $\epsilon =\infty $. We recorded the fastest running time for LSH while maintaining reasonable similarity, ie. we allowed somewhat worse similarity than for SOPTICS. DeLi-Clu requires only the $\textit{minPts}$ parameter. SOPTICS uses the same parameter value $\textit{minPts}$ as OPTICS (and DeLi-Clu) and we set $\textit{minSize} = minPts$. We set $f=0.2$. We performed $20\log (Nd)$ partitionings, i.e., calls to algorithm Partition from MultiPartition of the entire dataset for SOPTICS.

Table 2 Characteristics of the datasets used in the experiments (first four columns)

Full size table

8.2 Cluster quality

The original motivation of our work was to provide faster versions of existing density-based techniques while not compromising their accuracy. To compare the cluster results, we use the adjusted Rand index (Hubert and Arabie 1985). It returns a value between 0 and 1, where 1 indicates identical clustering. The adjusted Rand index corrects for the chance grouping of elements. The ordering of points as computed by OPTICS does not yield a clustering. Therefore, we defined a threshold that gives a horizontal line in the ordering plot. Whenever the threshold is exceeded, a new cluster begins. We chose the threshold for OPTICS and SOPTICS to match the actual clusters as well as possible and compared the clusters found by OPTICS and SOPTICS. The results are shown in Table 2 when using the classical definition of core-distance for OPTICS. Results are similar for both the OPTICS core-distance definition and our core-distance definition based on the average distance, but using our average density estimate yields slightly worse results. Note that the similarity metric consistently exceeds the 0.95 value, suggesting that SOPTICS provides clustering results that are indeed very close to those of OPTICS. More importantly, SOPTICS delivers these results significantly faster than both OPTICS and DeLi-Clu, as also shown in Table 2.

Figure 5 provides visual examples of the high similarity of the reachability plots for SOPTICS and OPTICS. We subtracted the mean of both plots. The difference of the means of OPTICS and SOPTICS was below 10% for all data sets. The reason is that the computation of the $\textit{minPts}$-nearest neighbor is not perfectly accurate. Thus, for some points, our approximation might be accurate, for others the computed neighborhood might consist of points not being part of the $\textit{minPts}$-nearest neighborhood. In our computation of the Davg, we used all points in the set resulting from the partitioning of points. Thus, for a point A, Davg is not necessarily computed using its $\textit{minPts}$-nearest neighbors, but potentially might miss some nearest neighbors and incorporate some points further away. The random partitioning induces a larger variance in Davg. In principle, one could filter outliers to reduce the variance, e.g., for a set S resulting from a partitioning, one could discard points that are far from the mean of all points in S. However, as the reachability plot and the extracted clusters matched very well for OPTICS and SOPTICS, we refrained from additionally filtering any outliers.

8.3 Runtime

Table 2 already shows the clear performance advantages of SOPTICS. Not surprisingly, OPTICS without an index is much slower. However, even using the LSH index does generally not yield satisfactory results. Our splitting of the entire point set is based on the number of points within a region. LSH splits the entire point set according to a fixed bin width, i.e., in a distance-based manner. This distance must inevitably be chosen rather large (e.g., close to maximum) among all points to get accurate results. Therefore, bins are generally (much) too large and contain many points, resulting in slow performance. DeLiClu is generally significantly faster than OPTICS, but still much slower than SOPTICS.

In addition to the experiments discussed above, we conduct scalability experiments using synthetically generated datasets according to a Gaussian distribution. Each Gaussian cluster consists of 1000 points. We use more than 120,000 objects and a dimensionality of 10 to evaluate scalability with respect to the number of objects. In a similar manner we generate synthetic datasets having up to 1200 dimensions to assess scalability with regard to dimensionality. The performance comparison between the various density-based techniques is shown in Fig. 6. It suggests a drastic improvement of SOPTICS compared with OPTICS and DeLi-Clu. For 130000 data points SOPTICS is more than 500 times faster than OPTICS and more than 20 times faster than DeLi-Clu. Note that DeLi-Clu uses an R*-tree structure to speed up various operations. Our approach bases its runtime improvements on random projections, thus is simpler to implement and maintain.

Figure 7 highlights the runtime for increasing data dimensionalities. Note that the performance gap between OPTICS and DeLi-Clu diminishes for higher dimensions. In fact, for more than 500 dimensions, OPTICS is faster than DeLi-Clu. This is due to the use of indexing techniques by DeLi-Clu. It is well understood that the performance of space-partitioning indexing structures, like R-trees, diminishes for increasing dimensionalities. The performance improvements of SOPTICS compared with OPTICS range from 47 times (at low dimensions) to 32 times (for high dimensions). A different trend is suggested in the runtime improvement against DeLi-Clu, which ranges from 17 times (at low dimensions) to 38 times (at high dimensions). Using OPTICS with an LSH index improves performance at higher dimensionalities, but the approach is still much slower than SOPTICS. We do not employ excessive parameter tuning for LSH (as we did for our small scale experiments), since the gain due to tuning did not create large differences in outcome. Therefore, when dealing with high-dimensional datasets, it is preferable to use techniques based on random projections.

9 Conclusion

Density-based techniques can provide the building blocks for efficient clustering algorithms. Our work contributes to density-based clustering by presenting SOPTICS, which is a random-projection-based version of the popular OPTICS algorithm. Not only is it orders of magnitude faster than OPTICS, but it also comes with analytical clustering preservation guarantees. In the spirit of reproducibility, we have also made available the source code of our approach.

Notes

SOPTICS presents small differences to the Fast OPTICS (FOPTICS) presented in Schneider and Vlachos (2013).
This condition could be removed for low-dimensional spaces, i.e., assuming d is constant.
Java is a registered trademark of Oracle and/or its affiliates.
https://elki.dbs.ifi.lmu.de/.
Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Other product or service names may be trademarks or service marks of other companies.
http://alumni.cs.ucr.edu/~mvlachos/erc/projects/density-based/.
The latest optimizations are not included in version 0.7.0.

References

Achtert E, Böhm C, Kröger P (2006) DeLi-Clu: boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In: Proceedings of the Pacific-Asia conference knowledge discovery and data mining (PAKDD), pp 119–128
Chapter Google Scholar
Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput Sci 18:369–378
Article Google Scholar
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 49–60
Asuncion A, Newman D (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.html
Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 661–670
Chitta R, Murty MN (2010) Two-level k-means clustering algorithm for k-tau relationship establishment and linear-time classification. Pattern Recognit 43(3):796–804
Article Google Scholar
Dasgupta S, Freund Y (2008) Random projection trees and low dimensional manifolds. In: Proceedings of the symposium on theory of computing (STOC), pp 537–546
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the annual symposium on computational geometry, pp 253–262
de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52
Article Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 226–231
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):341–352
Article Google Scholar
Hinneburg A, Gabriel H-H (2007) Denclue 2.0: fast clustering based on kernel density estimation. In: Advances in intelligent data analysis (IDA), pp 70–80
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 58–65
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article Google Scholar
Jain AK, Law MHC (2005) Data clustering: a user’s dilemma. In: Proceedings of the pattern recognition and machine intelligence, pp 1–10
Google Scholar
Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz maps into a Hilbert space. Contemp Math 26:189–206
Article Google Scholar
Koyutürk M, Grama A, Ramakrishnan N (2005) Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans Knowl Data Eng 17(4):447–461
Article Google Scholar
Lin C-J (2011) LibSVM datasets. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Loh W-K, Yu H (2015) Fast density-based clustering through dataset partition using graphics processing units. Inf Sci 308:94–112
Article Google Scholar
Schneider J, Vlachos M (2013) Fast parameterless density-based clustering via random projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 861–866
Schneider J, Vlachos M (2014) On randomly projected hierarchical clustering with guarantees. In: Proceedings of the SIAM international conference on data mining (SDM), pp 407–415
Chapter Google Scholar
Schneider J, Wattenhofer R (2011) Distributed coloring depending on the chromatic number or the neighborhood growth. In: International colloquium structural information and communication complexity (SIROCCO), pp 246–257
Chapter Google Scholar
Schneider J, Bogojeska J, Vlachos M (2014) Solving Linear SVMs with multiple 1D projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 221–230
Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. PVLDB 8(12):1976–1987
Google Scholar
Urruty T, Djeraba C, Simovici DA (2007) Clustering by random projections. In: Industrial conference on data mining, pp 107–119
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280
Article Google Scholar
Whang JJ, Sui X, Dhillon IS (2012) Scalable and memory-efficient clustering of large-scale social networks. In: Proceedings of the IEEE conference on data mining (ICDM), pp 705–714
Yu Y, Zhao J, Wang X, Wang Q, Zhang Y (2015) Cludoop: an efficient distributed density-based clustering for big data using hadoop. Int J Distrib Sens Netw 2015:579391. doi:10.1155/2015/579391
Article Google Scholar

Download references

Acknowledgements

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant Agreement No. 259569.

Author information

Authors and Affiliations

University of Liechtenstein, Vaduz, Liechtenstein
Johannes Schneider
IBM Research - Zürich, Rüschlikon, Switzerland
Michail Vlachos

Authors

Johannes Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Michail Vlachos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johannes Schneider.

Additional information

Responsible editor: G. Karypis.

Appendix

Proof of Theorem 8

Assume for now that the random line used for partitioning is chosen uniformly at random. By definition and using linearity of expectation, the expectation of $X^{short}_A$ is ${\mathbb {E}}[X^{short}_A]:= \sum _{C\in \mathcal {S}_A\setminus \mathcal {N}(A,c_dr)} p(E^{short}_A(C))$. Using Theorem 7 to bound $p(E^{short}_A(C))$,

$$\begin{aligned} {\mathbb {E}}[X^{short}_A]\le 3\log (c_d)/c_d|\mathcal {S}_A\setminus \mathcal {N}(A,c_dr)| \end{aligned}$$

The probability that the random variable $X^{short}_A$ exceeds the expectation ${\mathbb {E}}[X^{short}_A]$ by a factor $c_d/\log (c_d)^2$ or more is at most $\log (c_d)^2/c_d$ using Markov’s inequality. Thus, for the probability of event $E_0$ as defined below we have

Analogously, let us bound the probability of event $E^{long}_A(C)$ that a projection of two points C, A results in a distance $L\cdot (C-A)$ much beyond the expectation. Next, we use Theorem 7 to bound $p(E^{long}_A(C))$. By definition the expectation of $X^{long}_A$ is ${\mathbb {E}}[X^{long}_A]:=\sum _{C\in \mathcal {S}_A \cap \mathcal {N}(A,r)} p(E^{long}_A(C))$. Consider the upper bound of ${\mathbb {E}}[X^{long}_A]$ being ${\mathbb {E}}[X^{long}_A]\cdot c_d$, i.e., $c_d/(c_d)^{\log (c_d)/2} |\mathcal {S}_A \cap \mathcal {N}(A,r)|\ge 1/(c_d)^{\log (c_d)/3} |\mathcal {S}_A \cap \mathcal {N}(A,r)|$ (for $c_d>1$). Thus, define the probability of event $E_1$ and bound as before using Markov’s inequality as follows:

$$\begin{aligned} p(E_1):=p\left( X^{long}_A \le |\mathcal {S}_A \cap \mathcal {N}(A,r)|/(c_d)^{\log (c_d)/3}\right) \ge 1-1/c_d. \end{aligned}$$

Assume $E_0$ occurs. This excludes at most a fraction $\log (c_d)^2/{c_d}\in [0,1]$, ie. require $c_d>80$, of all possible projections for event $E_1$, leaving

$$\begin{aligned} (1-1/c_d-\log (c_d)^2/{c_d}) > 1- 2\log (c_d)^2/c_d. \end{aligned}$$

Thus, the probability of $E_1$ given $E_0$ becomes $p(E_1|E_0)=1-2\log (c_d)^2/c_d$.

The probability of event $E':=E_0 \cap E_1$ is

$$\begin{aligned} \begin{aligned} p(E')=p(E_1|E_0)\cdot p(E_0)&\ge (1-2\log (c_d)^2/{c_d})\cdot (1-\log (c_d)^2/{c_d}) \\&\ge (1-2\log (c_d)^2/{c_d})^2. \end{aligned} \end{aligned}$$

Let us deal with dependencies among chosen lines. We choose $c_L\cdot log N$ random lines independently from each other. Define $c_{E'}:=1-p(E')=2\log (c_d)^2/{c_d}$. Then using a Chernoff Bound (see Theorem1), we have that the probability of event $E_f(S_A)$ that the number of lines for which $E'$ does not hold for $\mathcal {S}_A$ exceeds the expectation $c_L\cdot \log N\cdot c_{E'}$ by at most a factor $1+\sqrt{3}/(c_L)^{1/4}$ is $1-1/N^{\sqrt{c_L} c_{E'}}$. Thus, assume that we “reuse” the projections for a total of $N^{c2}$ sets of points across all partitionings, ie. $c2<2$. Therefore, given that the projections have been reused $N^{c2}$ times the probability $p(E_f(S_C))$ for a set $S_C$ can be upper bounded using the bound for dependent events from Theorem 2 $1-1/N^{\sqrt{c_L} c_{E'}-c2}$. Thus, the probability for a bad event increases at most by a factor of $1+\sqrt{3}/(c_L)^{1/4}$ yielding that for $c_L$ sufficiently large.

$$\begin{aligned} p(E')\ge 1-2\log (c_d)^2/{c_d} \cdot ( 1+\sqrt{3}/(c_L)^{1/4}) \le 1-4\log (c_d)^2/{c_d} \end{aligned}$$

$\square $

Proof of Theorem 9

The idea of the proof is to look at a point A and remove “very” far away points until there are only relatively few of them left. Then, we consider somewhat closer points (but still quite far away) and remove them until we are left with only some very close points and some potentially further away points. Consider a partitioning of set $\mathcal {S}_A$ into two sets $\mathcal {S}_0$ and $\mathcal {S}_{1,A}$, i.e., $A \in \mathcal {S}_{1,A}$ using algorithm Partition and random projection line L. Assume that the following condition holds for set $\mathcal {S_A}$: There are many more points “very far away” from A than not so distant points using some factor $f_d\ge c_d$:

$$\begin{aligned} c_r |\mathcal {S}_A \cap \mathcal {N}(A,f_d\cdot r)| \le |\mathcal {S}_A \setminus \mathcal {N}(A,f_d\cdot r)| \end{aligned}$$

(2)

The value $c_r$ is defined later; we require $c_r\ge f_d$. We prove that even in this case after a sequence of splittings of the point set only few very far away points end up in set $\mathcal {S}_{1,A}$. (If there are fewer faraway points than somewhat close points, the probability that many of them end up in the same set is even smaller.) Define event $E_1$ as follows: A splitting point is picked such that for the subset $\mathcal {S}_{1,A}$ most very close points from $\mathcal {N}(A,r) \cap \mathcal {S}_A$ remain, i.e.,

$$\begin{aligned} |\mathcal {S}_{1,A} \cap \mathcal {N}(A,r)| \ge |\mathcal {S}_{A} \cap \mathcal {N}(A,r)| \cdot (1-1/c_r). \end{aligned}$$

The probability of event $E_1$ can be bounded as follows. Assume that $E'$ as defined in Theorem 8 occurs (using $f_d>c_d$ instead of $c_d$), i.e., most distances are scaled roughly by the same factor from a point A to other points. To minimize the probability of $p(E_1|E')$ we assume that all projected distances from faraway points to A are minimized and those of close points are maximized. This means that at most a fraction $1/\log {f_d}$ of all very far away points $\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)$ are below a factor $3\log (f_d)/f_d$ of their expected length and that the distances to all other points in $\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)$ are shortened exactly by that factor. We assume the worst possible scenario, i.e., those far away points are split such that they end up in the same set as A, i.e., they become part of $S_{1,A}$. At most a fraction $1/(f_d)^{\log (f_d)/3}$ of all very close points $\mathcal {S}_{A} \cap \mathcal {N}(A,r)$ are above a factor $\log (f_d)$ of the expectation. We assume that those points behave in the worst possible manner, i.e., the close points exceeding the expectation are split such that they end up in a different set than A, i.e., $S_{0}$ not $S_{1,A}$. Next, we bound the probability that no other points from $\mathcal {S}_{A} \cap \mathcal {N}(A,r)$ are split. If we pick a splitting point among the fraction of $1-1/\log f_d$ points from $\mathcal {S}_{A} \setminus \mathcal {N}(A,f_dr)$ that are not shortened by more than a factor $3\log (f_d)/f_d$, then $p(E_1|E')$ occurs. By initial assumption we have $(1-1/f_d^{\log (f_d)/3})|\mathcal {S}_{A} \cap \mathcal {N}(A,f_d\cdot r)|\le (1-1/\log {f_d})\cdot c_r |\mathcal {S}_A \setminus \mathcal {N}(A,f_dr)|$ and thus, $|\mathcal {S}_{A} \setminus \mathcal {N}(A,f_dr)|/|\mathcal {S}_{A}|\le 2/c_r$ for $1-1/\log f_d > 1/2$, i.e., $f_d$ sufficiently large, and because $|\mathcal {S}_{A}| \ge |\mathcal {S}_A \setminus \mathcal {N}(A,f_dr)|$. Put differently, the probability to pick a bad splitting point is at most $2/c_r$. The occurrence of event $E'$ reduces the probability of $E_1$ at most by $1-p(E')$, i.e., $p(E_1|E')\ge p(E_1)-(1-p(E'))$.

Therefore,

$$\begin{aligned} \begin{aligned} p(E_1)&= p(E')p(E_1|E') \\&= p(E')\cdot (1- |\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)|/|\mathcal {S}_{A}|-(1-p(E'))) \\&\ge p(E')\cdot (1-2/c_r - 4\log (f_d)^2/f_d)) \\&\ge (1-4\log (f_d)^2/f_d)^2\cdot (1-6\log (f_d)^2/\min ({f_d},c_r)) \qquad \text {(Substitution of } p(E')) \\&= (1-6\log (f_d)^2/f_d)^3 \qquad \text {(since by definition } c_r\ge f_d) \end{aligned} \end{aligned}$$

Define event $E_2$ as follows: At least $1/3-1/c_r$ of all far away points $|\mathcal {S}_A\setminus \mathcal {N}(A,f_dr)|$ are not contained in $\mathcal {S}_{1,A}$, i.e.,

$$\begin{aligned} |\mathcal {S}_A\setminus \mathcal {N}(A,f_dr)|\ge 2/3|\mathcal {S}_{1,A} \setminus \mathcal {N}(A,f_dr)|. \end{aligned}$$

The probability that the size of the set resulting from the split $\mathcal {S}_{1,A}$ is at most 2/3 of the original set $\mathcal {S}_{A}$ is 1 / 3, because a splitting point is chosen uniformly at random. When restricting our choice to far away points $\mathcal {S}_A\setminus \mathcal {N}(A,f_dr)$, we can use that owing to Condition (2) at most a fraction $1/c_r$ of all points are not far away. The probability of $E_2$ given $E_1$ can be bounded by assuming that all events, i.e., choices of random lines and splitting points, that are excluded owing to the occurrence of $E_1$ actually would have caused $E_2$. More precisely, we can subtract the probability of the complementary event of $E_1$, i.e., $p(E_2|E_1) = 2/3-1/c_r - (1-p(E_1)) \ge 2/3 - 1/c_r - (1-4\log (c_d)^2/{f_d})^3 \ge 1/4$ for a sufficiently large constant $f_d$. The initial set $\mathcal {S}:=\mathcal {P}$ has to be split at most $c_L\log N$ times until the final set $\mathcal {S}_A$ containing A (which is not split any further) is computed (see proof of Theorem 3). We denote a trial T as up to $\log f_d$ splits of a set $\mathcal {S}$ into two sets. A trial T is successful if after at most $\log f_d$ splits of a set $\mathcal {S}_A$ the final set $\mathcal {S}'_A\subset \mathcal {S}_A$ is of size at most $|\mathcal {S}_A|/2$ and $E_1$ occurred for every split. The probability for a successful trial p(T) is equal to the probability that $E_1$ always occurs and $E_2$ at least once. This gives:

$$\begin{aligned} p(T)= & {} p(E_1)^{\log f_d}\cdot (1-p(E_2|E_1)^{\log f_d}) \nonumber \\\ge & {} (1-6\log (f_d)^2/f_d)^{3\log f_d} \cdot (1-1/4^{\log f_d}) \nonumber \\\ge & {} (1-6\log (f_d)^2/f_d)^{4\log f_d} \end{aligned}$$

(3)

Starting from the entire point set we need $\log (N/minSize)+1$ (consecutive) successful trials until a point A is in a set of size less than $\textit{minSize}$ and the splitting stops. Next we prove that the probability to have that many successful trials is constant given that the required upper bound on the neighborhood holds, i.e., (1). Assume there are $n_i$ points within distance $[i^{3/2+c_s} \cdot c_d\cdot r, (i+1)^{3/2+c_s}\cdot c_d\cdot r]$ for a positive integer i. In particular, note that the statement holds for arbitrarily positioned points. We do not even require them to be fixed across several trials.

The upper bound on the neighborhood growth (1) yields that $n_i\le 2^{i^{1/2}}\cdot |\mathcal {N}(A,c_dr)|$. Furthermore, we have that $\sum _{i=1}^{\infty } n_i \le N$. Next, we analyze how many trials we need to remove points $n_i$ until only the close points $\mathcal {N}(A,c_dr)$ remain. We are going from large i to small i, i.e., remove distant points first. For each $n_i$ we need at most $\log n_i - \log |\mathcal {N}(A,c_dr)| \le i^{1/2}$ successes. Let $E_{n_{i}}$ be the event that this happens, i.e., that we have that many consecutive successes.

$$\begin{aligned} p(E_{n_{i}}):= & {} \prod _{j=1}^{\log n_i - \log |\mathcal {N}(A,c_dr)|} p(T) \nonumber \\= & {} \prod _{j=1}^{\log n_i - \log |\mathcal {N}(A,c_dr)|} \left( 1-6\log (x)^2/(x)\right) ^{4\log x} \quad \text {(Defining } x:= c_d\cdot i^{3/2+c_s} ) \nonumber \\= & {} \prod _{j=1}^{\sqrt{i}} \left( 1-1/2^{\log (x)- 2\log (6\log x)}\right) ^{4\log (x)} \nonumber \\= & {} 2^{4 \sqrt{i} \log (x)\log \left( 1-1/2^{\log (x)-2\log (6\log x)}\right) } \nonumber \\= & {} 2^{-4 \sqrt{i} \log (x)\cdot 1/2^{\log (x)-2\log (6\log x)}}\qquad \text {(Using }\log (1-x)\le -x)\nonumber \\\ge & {} 2^{ -4 \sqrt{i}\cdot \log (x)\cdot \log (x)^4 /x} \qquad \text {(Using } 2^{2\log (6\log x)}= (6\log (x))^2, 2^{\log (x)}=x) \nonumber \\= & {} 2^{\frac{-24\sqrt{i}\log (c_d\cdot i^{3/2+c_s})^3}{c_d\cdot i^{3/2+c_s}}} \nonumber \\\ge & {} 2^{-\frac{1}{c_d\cdot i}} \text {(for } c_s \text { and } c_d \text { sufficiently large)} \end{aligned}$$

(4)

As the number of points N is finite, the number of $n_i>0$ is also finite. Let $m_A$ be the largest value such that $n_{m_A}>0$. Let $p_A:=p(\wedge _{i\in [1,m_A]} {E_{n_{i}}})$ be the probability that all trials for all $n_i$ in $i \in [1,m_A]$ and $n_i>0$ are successful. Note that the events $E_{n_i}$ are not independent for a fixed point set $\mathcal {P}$. However, the bound (4) on $p(E_{n_{i}})$ holds as long as condition 2 is fulfilled, i.e., for an arbitrary point set. Put differently, the bound (4) holds even for the “worst” distribution of points. Therefore, we have that $p_A:=p(\wedge _{i\in [1,m_A]} {E_{n_{i}}})\ge \prod _{i\in [1,m_A]} 2^{-1/(i\cdot c_d)}$ using stochastic domination. Note that our choice of maximizing $n_i$, i.e., the number of required successful trials for $E_{n_i}$ minimizes the probability of a $p(E_{n_i})$. This is quite intuitive, since it says that we should maximize the number of points closest to A that should not be placed in the same set as A (i.e., they are just a bit too far to yield the claimed approximation guarantee). We also need to be aware of the fact that the distribution for the $n_i$ under the constraint that $\sum _{i=1}^{m_A} n_i \le N$ should minimize the bound for $p_A$. It is also apparent from the derivation of (4) that this happens when we maximize $n_i$; the probability for $p_A$ decreases more if we maximize small i. Essentially, this follows from line 2 in (4) because the number of trials $n_T$ is less than $\sqrt{i}$ and each trial is successful with probability of $(1-1/i^{3/2})$ (focusing on dominating terms), yielding an overall success probability of $(1-1/i^{3/2})^{n_T}$ for a trial. Thus, $(1-1/i^{3/2})^{\sqrt{i}}> (1-1/l^{3/2})^{\sqrt{l}}$ for $1<i<l$. Put differently, choosing $n_i$ large for a large i is not a problem for our algorithm, because it is unlikely that these points will be projected in between the nearest points to A.

Therefore, when maximizing the number of points close to A, we have that $m_A=(\log N)^2$, i.e., all $n_i$ for $i>(\log N)^2$ are 0 because $2^{\sqrt{(\log N)^2}}=n_{(\log N)^2}=N$. Additionally, note that we need at most $c_8 \log N$ trials in total. As each trial slices the number of points by 1 / 2, we only need to take into the account the subset $X\in [1,m_A]$ for which the number of points doubles, i.e., $n_j=2\cdot n_{i}$, for $n_i=2^{i^{1/2}}$. This happens whenever $i^{1/2}$ is an integer, i.e., for $i=1,4,9,16,\ldots $, we get $n_i=1,2,3,4,\ldots $. Thus, we only need to look at $i^2 \in [1,m_A]$

$$\begin{aligned} p_A\ge & {} \prod _{i^2\in [1,m_A]} 2^{- 1/(c_d\cdot i)}\\\ge & {} \prod _{i^2\in \left[ 1,\log ^2 N \right] } 2^{-1/(c_d\cdot i)}\\\ge & {} 2^{-1/c_d \sum _{i^2\in \left[ 1,\log ^2 N \right] } 1/i}\\= & {} 2^{-1/c_d \sum _{i\in \left[ 1,\log N\right] } 1/i^2}\\\ge & {} 2^{-2/c_d} \\\ge & {} 1/2^{2/c_d} \end{aligned}$$

Thus, when doing $ c_p (\log N)$ partitionings, we have at least $c_p/16 \log N$ successes for point A whp using Theorem 1 and $c_d\ge 1$. This also holds for all points whp using Theorem 2.

Finally, let us bound the number of nearby points that remain. We need at most $c_L \log N$ (see Theorem 3) projections until a point set will not be split further. Each projection reduces the points $|\mathcal {N}(A,r)|$ at most by factor $1-1/c_r$. We give a bound in two steps, i.e., for $c_r\ge \log ^3 N$ and $c_r \in [ c_d, \log ^3 N]$.

$$\begin{aligned} \prod _{i=1}^{c_L\log N} (1-1/c_r)\ge & {} \left( 1-1/\log ^3 N)^{c_L\log N} \text { (Assuming }c_r\ge \log ^3 N \right) \\\ge & {} 1-1/\log N \end{aligned}$$

To reduce the number of points by a factor of $\log ^3 N$ requires $3\cdot \log \log N $ trials, each reducing the set by a factor 1/2. Thus, trial i is conducted using a factor $c_r = \log ^3 N / 2^i$ of the original points or, equivalently, trial $3\cdot \log \log N-i$ is conducted with $c_r= 2^i$. Thus, in total the fraction of remaining points in $\mathcal {N}(A,r)$ is

$$\begin{aligned} (1-1/\log N) \prod _{i=1}^{3 \log \log N} (1-1/2^i)^{\log c_d}= & {} (1-1/\log N) \cdot \left( \prod _{i=1}^{3 \log \log N} (1-1/2^i)\right) ^{\log c_d}\\= & {} (1-1/\log N) \cdot \left( 2^{\sum _{i=\log c_d}^{3 \log \log N} \log (1-1/2^i)}\right) ^{\log c_d} \\\ge & {} (2^{1-2/c_d})^{\log c_d} \ge 1/(2c_d) \end{aligned}$$

$\square $

Proof of Theorem 10

First we bound the number of neighbors. Using Theorem 9 we obtain $c_p/16 (\log N)$ sets $\mathcal {S}_A$ containing A. Define $\mathfrak {S}_A$ to be the union of all sets $\mathcal {S}_A \in \mathfrak {S}$ containing A. Before the last split of a set $\mathcal {S}_A$ resulting in the sets $\mathcal {S}_{1,A}$ and $\mathcal {S}_2$, the set $\mathcal {S}_A$ must be of size at least $c_m\cdot \textit{minPts}$; the probability that splitting it at a random point results in a set $\mathcal {S}_{1,A}$ with $|\mathcal {S}_A|<c_m/2\cdot \textit{minPts}$ is at most 1/2. Thus, using a Chernoff bound (Theorem 1), at least $c_p/128\log N$ sets $\mathcal {S}_A \in \mathfrak {S}_A$ are of size at least $c_m/2\cdot \textit{minPts}$ whp.

Let $\mathcal {S}_A$ be a set $\mathcal {S}_A$ with size at least $c_m/2\cdot \textit{minPts}$. Consider the process when the neighborhood $\mathcal {N}(A)$ is built by inspecting one set $\mathcal {S}_A$ after the other. Assume that the number of neighbors $|\mathcal {N}(A)|< c_m/2 minPts/(2{c_d})$. Thus, the probability of event $p(\text {Choose new close neighbor } B)=p(B \not \in \mathcal {N}(A) \wedge B \in \mathcal {N}(A,r))$ that a point $B \in \mathcal {S}_A$ but not already in $\mathcal {N}(A)$ is chosen from $\mathcal {N}(A,r) \cap \mathcal {S}_A$ is at least $c_m/(4{c_d})$.

$$\begin{aligned}&p\left( \text {Choose new close neighbor } B| |\mathcal {N}(A)|< c_m/2\cdot \textit{minPts}/(2c_d)\right) \\&\quad := p(B \not \in \mathcal {N}(A) \wedge B \in \mathcal {N}(A,r))) = c_m/(4{c_d}) \end{aligned}$$

As by assumption $\textit{minPts} < c_m \log N$ and there are at least $c_p/128\log N$ sets $\mathcal {S}_A$ with $|\mathcal {S}_A|\ge c_m/2\cdot \textit{minPts}$ and $c_p \ge c_m\cdot 128$, using the Chernoff bound in Theorem 1 we get that there are at least $c_m/(4{c_d}) minPts$ points within distance $D_{c_m \textit{minPts}}(A)$ in $\mathcal {N}(A)$ whp for every point A. Setting $c_m\ge 8 c_d$ completes the proof. $\square $

Proof of Theorem 12

To compute Davg(A) with $f=1$ we consider the $(1+f)\cdot \textit{minPts}= 2 minPts $ closest points to A from $\mathcal {N}(A)$. Using Theorem 10 2minPts points are contained in $\mathcal {N}(A)$ with distance at most $D_{c_m \textit{minPts}}(A)$. This yields $Davg(A)\le D_{c_m \textit{minPts}}(A)$. Thus, the upper bound follows. To compute Davg(A), we average the distance using the $2\cdot \textit{minPts}$-closest points to A. The smallest value of Davg(A) is reached when $\mathcal {N}(A)$ contains all $2\cdot \textit{minPts}$ closest points to A, which implies $Davg(A)\ge D_{2\cdot \textit{minPts}}(A)\ge D_{ \textit{minPts}}(A)$ for any set of neighbors $\mathcal {N}(A)$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schneider, J., Vlachos, M. Scalable density-based clustering with quality guarantees using random projections. Data Min Knowl Disc 31, 972–1005 (2017). https://doi.org/10.1007/s10618-017-0498-x

Download citation

Received: 23 June 2015
Accepted: 19 February 2017
Published: 02 March 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10618-017-0498-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scalable density-based clustering with quality guarantees using random projections

Abstract

Similar content being viewed by others

A New Density Clustering Method Using Mutual Nearest Neighbor

RADDACL2: a recursive approach to discovering density clusters

Density-Based Clustering Based on Hierarchical Density Estimates

Explore related subjects

1 Introduction

2 Background and related work

3 Our approach

3.1 Preliminaries

Theorem 1

Theorem 2

4 Pre-process: data partitioning

Theorem 3

Proof

Theorem 4

Proof

5 Neighborhood

Theorem 5

Proof

Corollary 1

6 Density estimate

6.1 Density estimate

Definition 1

Definition 2

Definition 3

7 Density-based clustering using reachability

7.1 OPTICS

7.2 SOPTICS: speedy OPTICS

Theorem 6

Proof

Definition 4

Definition 5

7.3 Theoretical analysis

Theorem 7

Proof

Theorem 8

Proof

Theorem 9

Proof

Theorem 10

Proof

Theorem 11

Proof

Theorem 12

Proof

8 Empirical evaluation

8.1 Datasets

8.2 Cluster quality

8.3 Runtime

9 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Proof of Theorem 8

Proof of Theorem 9

Proof of Theorem 10

Proof of Theorem 12

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation