figure a

1 Introduction

Digitalized applications’ datasets are getting larger in size and number of features (i.e., dimensions), posing challenges to established data mining methods such as clustering, an unsupervised data mining tool based on similarity measures. Density-based spatial clustering of applications with noise (DBSCAN) [11] is a prominent method to cluster (possibly) noisy data into arbitrary shapes and sizes, without prior knowledge on the number of clusters, using user-defined similarity metrics (i.e., not limited to the Euclidean one). DBSCAN is used in many applications, including LiDAR [25], object detection [20], and GPS route analysis [33]. DBSCAN and some of its variants have been also used to cluster high dimensional data, e.g., medical images [4], text [32], and audio [10].

The computational complexity of traditional DBSCAN is in the worst-case quadratic in the input size [26], expensive considering attributes of today’s datasets. Nonetheless, indexing and spatial data structures facilitating proximity searches can ease DBSCAN’s computational complexity, as shown with KD-trees [5], R-trees [15], M-trees [8], and cover trees [6]. Using such structures is suboptimal in at least three cases, though: (i) skewed data distributions negatively affect their performance [20], (ii) the dimensionality curse results in exact spatial data structures based on deterministic space partitioning being slower than linear scan [34], and (iii) such structures only work for a particular metric (e.g., Euclidean distance). In the literature, the major means for enhancing time-efficiency are those of parallelization [2, 3, 13, 24, 33] and approximation [12], studied alone or jointly [20, 21, 33]. However, state-of-the-art methods target Euclidean distance only and suffer from skewed data distributions and the dimensionality curse.

Locality-sensitive hashing (LSH) is an established approach for approximate similarity search. Based on the idea that if two data points are close using a custom similarity measure, then an appropriate hash function can map them to equal values with high probability [1, 9, 17], LSH can support applications that tolerate approximate answers, close to the accurate ones with high probability. LSH-based indexing has been successful (and shown to be the best known method [17]) for finding similar items in large high-dimensional data-sets. With our contribution, the \(\mathtt {IP.LSH.DBSCAN}\) algorithm, we show how the processes of approximate density-based clustering and that of creating an LSH indexing structure can be fused to boost parallel data analysis. Our novel fused approach can efficiently cope with high dimensional data, skewed distributions, large number of points, and a wide range of distance functions. We evaluate the algorithms analytically and empirically, showing they complement the landscape of established state-of-the-art methods, by offering up to several orders of magnitude speed-up on higher dimensional datasets, with tunable high clustering accuracy.

Organization: Section 2 reviews the preliminaries. Section 3 and Sect. 4 describe and analyse the proposed \(\mathtt {IP.LSH.DBSCAN}\). Section 5 covers the empirical evaluation. Related work and conclusions are presented in Sect. 6 and Sect. 7, respectively.

2 Preliminaries

System Model and Problem Description. Let \(\mathtt {D}\) denote an input set of \(\mathtt {N}\) points, each a multi-dimensional vector from a domain \(\mathcal {D}\), with a unique identifier \(\mathtt {ID}\). \(\mathtt {Dist}\) is a distance function applicable on \(\mathcal {D}\)’s elements. The goal is to partition \(\mathtt {D}\) into an a priori unknown number of disjoint clusters, based on \(\mathtt {Dist}\) and parameters \(\mathtt {minPts}\) and \(\epsilon \): \(\mathtt {minPts}\) specifies a lower threshold for the number of neighbors, within radius \(\epsilon \), for points to be clustered together.

We aim for an efficient, scalable parallel solution, trading approximations in the clustering with reduced calculations regarding the density criteria, while targeting high accuracy. Our evaluation metric for efficiency is completion time. Accuracy is measured with respect to an exact baseline using rand index [31]: given two clusterings of the same dataset, the rand index is the ratio of the number of pairs of elements that are either clustered together or separately in both clusterings, to the total number of pairs of elements. Regarding concurrency guarantees, a common consistency goal is that for every parallel execution, there exists a sequential one producing an equivalent result.

We consider multi-core shared-memory systems executing \(\mathtt {K}\) threads, supporting \(\mathtt {read}\), \(\mathtt {write}\) and read-modify-write atomic operations, e.g., \(\mathtt {CAS}\) (Compare-And-Swap), available in all contemporary general purpose processors.

Locality Sensitive Hashing (LSH)

The following defines the sensitivity of a family of LSH functions [17, 23], i.e., the property that, with high probability, nearby points hash to the same value, and faraway ones hash to different values.

Definition 1

A family of functions \(\mathcal {H}\) \(=\{{{ {\mathtt {h}}}}: \mathcal {D} \rightarrow {{ {\mathtt {U}}}} \}\) is \(({{ {\mathtt {d_1}}}}, {{ {\mathtt {d_2}}}}, {{ {\mathtt {p_1}}}}, {{ {\mathtt {p_2}}}})\)-sensitive for distance function \(\mathtt {Dist}\) if for any \(\mathtt {p}\) and \(\mathtt {q}\) in \(\mathcal {D}\) the following conditions hold: (i) if \(\mathtt {Dist}\)\(({{ {\mathtt {p}}}}, {{ {\mathtt {q}}}}) \le {{ {\mathtt {d_1}}}}\), then \(\Pr _{{\mathcal {H}}}[{{ {\mathtt {h({ {\mathtt {p}}})}}}} = {{ {\mathtt {h({ {\mathtt {q}}})}}}}] \ge {{ {\mathtt {p_1}}}}\) (ii) if \(\mathtt {Dist}\)\(({{ {\mathtt {p}}}}, {{ {\mathtt {q}}}}) \ge {{ {\mathtt {d_2}}}}\), then \(\Pr _{{\mathcal {H}}}[{{ {\mathtt {h({ {\mathtt {p}}})}}}} = {{ {\mathtt {h({ {\mathtt {q}}})}}}}] \le {{ {\mathtt {p_2}}}}\). The probabilities are over the random choices in \(\mathcal {H}\).

A family \(\mathcal {H}\) is useful when \(\mathtt {p_1}\) > \(\mathtt {p_2}\) and \(\mathtt {d_1}\) < \(\mathtt {d_2}\). LSH functions can be combined, into more effective (in terms of sensitivity) ones, as follows [23]:

Definition 2

(i) AND-construction: Given a \(({{ {\mathtt {d_1}}}}, {{ {\mathtt {d_2}}}}, {{ {\mathtt {p_1}}}}, {{ {\mathtt {p_2}}}})\)-sensitive family \(\mathcal {H}\) and an integer \(\mathtt {M}\), we can create a new LSH family \(\mathcal {G}\) \(=\{{{ {\mathtt {g}}}}: \mathcal {D} \rightarrow {{ {\mathtt {U}}}}^{\mathtt {M}{}} \}\) by aggregating/concatenating \(\mathtt {M}\) LSH functions from \(\mathcal {H}\), where \(\mathtt {g}\)(\(\mathtt {p}\)) and \(\mathtt {g}\)(\(\mathtt {q}\)) are equal iff \(\mathtt {h_j}\)(\(\mathtt {p}\)) and \(\mathtt {h_j}\)(\(\mathtt {q}\)) are equal for all \(\mathtt {j}\) \(\in \{1,\cdots ,\!{\mathtt {M}{}}\}\), implying \(\mathcal {G}\) is \(({{ {\mathtt {d_1}}}}, {{ {\mathtt {d_2}}}}, {{ {\mathtt {p_1}}}}^{\mathtt {M}{}}, {{ {\mathtt {p_2}}}}^{\mathtt {M}{}})\)-sensitive; (ii) OR-construction: Given an LSH family \(\mathcal {G}\) and an integer \(\mathtt {L}\), we can create a new LSH family \(\mathcal {F}\) where each \(\mathtt {f}\) \(\in \) \(\mathcal {F}\) consists of \(\mathtt {L}\) \(\mathtt {g_i}\)s chosen independently and uniformly at random from \(\mathcal {G}\), where \(\mathtt {f}\)(\(\mathtt {p}\)) and \(\mathtt {f}\)(\(\mathtt {q}\)) are equal iff \(\mathtt {g_j}\)(\(\mathtt {p}\)) and \(\mathtt {g_j}\)(\(\mathtt {q}\)) are equal for at least one \(\mathtt {j}\) \(\in \{1,\cdots ,\!{\mathtt {L}{}}\}\). \(\mathcal {F}\) is \(({{ {\mathtt {d_1}}}}, {{ {\mathtt {d_2}}}}, 1-(1-{{ {\mathtt {p_1}}}}^{\mathtt {M}{}})^{\mathtt {L}{}}, 1-(1-{{ {\mathtt {p_2}}}}^{\mathtt {M}{}})^{\mathtt {L}{}})\)-sensitive assuming \(\mathcal {G}\) is \(({{ {\mathtt {d_1}}}}, {{ {\mathtt {d_2}}}}, {{ {\mathtt {p_1}}}}^{\mathtt {M}{}}, {{ {\mathtt {p_2}}}}^{\mathtt {M}{}})\)-sensitive.

LSH structure: An instance of family \(\mathcal {F}\) is implemented as \(\mathtt {L}\) hash tables; the \(\mathtt {i}\)-th table is constructed by hashing each point in \(\mathtt {D}\) using \(\mathtt {g_i}\) [9, 17]. The resulting data structure associates each bucket with the values for the keys mapping to its index. LSH families can associate with various distance functions [23].

LSH for Euclidean distance: Let \(\mathtt {u}\) be a randomly chosen unit vector in \(\mathcal {D}\). A hash function \(\mathtt {h_u}\)(\(\mathtt {x}\)) in such a family is defined as \(\lfloor \frac{{{ {\mathtt {x}}}} \cdot {{ {\mathtt {u}}}}}{\epsilon } \rfloor \), where \(\cdot \)  is the inner product operation and \(\epsilon \) is a constant. The family is applicable for any number of dimensions. In a 2-dimensional domain, it is \((\epsilon /2, 2\epsilon , 1/2, 1/3)\)-sensitive.

LSH for angular distance: Let \(\mathtt {u}\) be a randomly chosen vector in \(\mathcal {D}\). A hash function \(\mathtt {h_u}\)(\(\mathtt {x}\)) in such a family is defined as \({{ {\mathtt {sgn}}}}({{ {\mathtt {x}}}} \cdot {{ {\mathtt {u}}}})\). The family is \((\theta _1,\theta _2,1-\frac{\theta _1}{\pi },1-\frac{\theta _2}{\pi })\)-sensitive, where \(\theta _1\) and \(\theta _2\) are any two angles (in radians) such that \(\theta _1 < \theta _2\).

Related Terms and Algorithms DBSCAN: partitions \(\mathtt {D}\) into an a priori unknown number of clusters, each consisting of at least one core-point (i.e., one with at least \(\mathtt {minPts}\) points in its \(\epsilon \)-radius neighbourhood) and the points that are density-reachable from it. Point q is density-reachable from p, if q is directly reachable from p (i.e., in its \(\epsilon \)-radius neighbourhood) or from another core-point that is density-reachable from p. Non-core-points that are density-reachable from some core-point are called border points, while others are noise [26]. DBSCAN can utilize any distance function e.g., Euclidean, Jaccard, Hamming, angular [23]. Its worst-case time complexity is \(\mathcal {O}({{ {\mathtt {N}}}}^2)\), but in certain cases (e.g., Euclidean distance and low-dimensional datasets) its expected complexity lowers to \(\mathcal {O}({{ {\mathtt {N}}}} \log {{{ {\mathtt {N}}}}})\), through indexing structures facilitating range queries to find \(\epsilon \) neighbours [26].

HP-DBSCAN [13]: Highly Parallel DBSCAN is an OpenMP/MPI algorithm, super-imposing a hyper-grid over the input set. It distributes the points to computing units that do local clusterings. Then, the local clusters that need merging are identified and cluster relabeling rules get broadcasted and applied locally.

PDS-DBSCAN [24]: An exact parallel version of Euclidean DBSCAN that uses a spatial indexing structure for efficient query ranges. It parallelizes the work by partitioning the points and merging partial clusters, maintained via a disjoint-set data structure, also known as union-find (a collection of disjoint sets, with the elements in each set connected as a directed tree). Such a data structure facilitates in-place \(\mathtt {find}\) and \(\mathtt {merge}\) operations [18] avoiding data copying. Given an element p, \(\mathtt {find}\) retrieves the root (i.e., the representative) of the tree in which p resides, while \(\mathtt {merge}\) merges the sets containing two given elements.

Theoretically-Efficient and Practical Parallel DBSCAN [33]: Via a grid-based approach, this algorithm identifies core-cells and utilizes a union-find data structure to merge the neighbouring cells having points within \(\epsilon \)-radius. It uses spatial indexes to facilitate finding neighbourhood cells and answering range queries.

LSH as index for DBSCAN: LSH’s potential led other works ([27, 35]) to consider it as a plain means for neighbourhood queries. We refer to them as \(\mathtt {VLSHDBSCAN}\).

3 The Proposed \(\mathtt {IP.LSH.DBSCAN}\) Method

\(\mathtt {IP.LSH.DBSCAN}\) utilizes the LSH properties, for parallel density-clustering, through efficient fusion of the indexing and clustering formation. On the high level, \(\mathtt {IP.LSH.DBSCAN}\) hashes each point in \(\mathtt {D}\), into multiple hash-tables, in such a way that with a high probability, points within \(\epsilon \)-distance get hashed to the same bucket at least once across all the tables. E.g., Fig. 1a shows how most nearby points in a subset of \(\mathtt {D}\) get hashed to the same buckets, in two hash tables. Subsequently, the buckets containing at least \(\mathtt {mintPts}\) elements are examined, to find a set of candidate core-points which later will be filtered to identify the real core-points, in terms of DBSCAN’s definition. In Fig. 1a, the core-points are shown as bold points with a dot inside. The buckets containing core-points are characterized as core buckets. Afterwards, with the help of the hash tables, \(\epsilon \)-neighbour core-points get merged. E.g., the core-bucket in the rightmost hash table in Fig. 1a contains two core-points, indicating the possibility that they are within each other’s \(\epsilon \)-neighbourhood, in which case they get merged. The merging is done using a forest of union-find data structures, consisting of such core-points, that essentially represent core buckets. As we see later, multiple threads can work in parallel in these steps.

Fig. 1.
figure 1

1a shows nearby points get hashed to the same bucket at least once across hash tables, whp. Core-points are the bold ones with a dot inside.

Key Elements and Phases. Similar to an LSH structure (cf. Sect. 2), we utilize \(\mathtt {L}\) hash tables (\(\mathtt {hashTable}\)[1], \(\cdots \), \(\mathtt {hashTable}\)[\(\mathtt {L}\)]), each constructed using \(\mathtt {M}\) hash functions, chosen according to distance metric \(\mathtt {Dist}\) and threshold \(\epsilon \) (see Sect. 2, Sect. 4).

Definition 3

A bucket in any of the hash tables is called a candidate core-bucket if it contains at least \(\mathtt {minPts}\) elements. A candidate core-point \(\mathtt {c}\) in a candidate core-bucket \(\mathtt {ccb}\) is defined to be the closest (using function \(\mathtt {Dist}\)) point in \(\mathtt {ccb}\) to the centroid of all the points in \(\mathtt {ccb}\); we also say that \(\mathtt {c}\) represents \(\mathtt {ccb}\). A candidate core bucket \(\mathtt {ccb}\), whose candidate core-point \(\mathtt {c}\) has at least \(\mathtt {minPts}\) neighbours within its \(\epsilon \)-radius in \(\mathtt {ccb}\), is called a core-bucket. A Core-forest is a concurrent union-find structure containing core-points representing core buckets.

Lemma 1

Given a core bucket, its corresponding core-point \(\mathtt {c}\) is a true core-point according to DBSCAN.

The above follows from the definition of core-point in Definition 3. Next, we present an outline of \(\mathtt {IP.LSH.DBSCAN}\)’s phases, followed by a detailed description of its parallelization and pseudo-code.

figure b
figure c

Parallelism and Algorithmic Implementation. We here present the parallelization in \(\mathtt {IP.LSH.DBSCAN}\) (Algorithm 1), targeting speed-up by distributing the work among \(\mathtt {K}\) threads. We also aim at in-place operations on data points and buckets (i.e., without creating additional copies), hence work with pointers to the relevant data points and buckets in the data structures.

Phase I (hashing and bucketing): To parallelize the hashing of the input dataset \(\mathtt {D}\) into \(\mathtt {L}\) hash tables, we (logically) partition \(\mathtt {D}\) into \(\mathtt {S}\) mutually disjoint batches. Consecutively, we have \({\mathtt {S}{}} \times {\mathtt {L}{}}\) hash tasks, corresponding to the Cartesian product of the hash tables and the data batches. Threads can book a hash task and thus share the workload through \(\mathtt {hashTasks}\), which is a boolean \({\mathtt {S}{}} \times {\mathtt {L}{}}\) array containing a status for each task, initially \(\mathtt {false}\). A thread in this phase scans the elements of \(\mathtt {hashTasks}\), and if it finds a non-booked task, it tries to atomically book the task (e.g., via a \(\mathtt {CAS}\) to change the status from \(\mathtt {false}\) to \(\mathtt {true}\)). The thread that books a hash task \(\mathtt {ht_{b,t}}\) hashes each data point \(\mathtt {p}\) in batch \(\mathtt {b}\) into \(\mathtt {hashTable}\)[\(\mathtt {t}\)] using hash function \(\mathtt {g_t}\). Particularly, for each point \(\mathtt {p}\), a key-value pair consisting of the hashed value of \(\mathtt {p}\) and a pointer to \(\mathtt {p}\) is inserted in \(\mathtt {hashTable}\)[\(\mathtt {t}\)]. As entries get inserted into the hash tables, pointers to buckets with at least \(\mathtt {minPts}\) points are added to \(\mathtt {candidateCoreBuckets}\). Since the threads operate concurrently, we use hash tables supporting concurrent insertions and traversals. Algorithm 1 l.8–l.14 summarizes Phase I.

Phase II (core-point identification): Here the threads identify core-buckets and core-points. Each thread atomically pops a candidate core bucket \(\mathtt {ccb}\) from \(\mathtt {candidateCoreBuckets}\) and identifies the closest point to the centroid of the points in \(\mathtt {ccb}\), considering it as a candidate core-point, \(\mathtt {ccp}\). If there are at least \(\mathtt {minPts}\) points in \(\mathtt {ccb}\) within \(\epsilon \)-radius of \(\mathtt {ccp}\), then \(\mathtt {ccp}\) and \(\mathtt {ccb}\) become core-point and core-bucket, respectively, and \(\mathtt {ccp}\) is inserted in the core-forest and the \(\mathtt {ccb}\) in the \(\mathtt {coreBucekts}\) set. This phase, shown in Algorithm 1 l.15–l.20, is finished when \(\mathtt {candidateCoreBuckets}\) becomes empty.

Phase III (merge-task identification and processing): The threads here identify and perform merge tasks. For each core-bucket \(\mathtt {cb}\) that a thread successfully books from the set \(\mathtt {coreBuckets}\), the thread \(\mathtt {merge}\)s the sets corresponding to the associated core-point with \(\mathtt {cb}\) and any other core-point in \(\mathtt {cb}\) within \(\epsilon \) distance. For merging, the algorithm uses an established concurrent implementation for disjoint-sets, with linearizable and wait-free (i.e., the effects of concurrent operations are consistent with the sequential specification, while the threads can make progress independently of each other) \(\mathtt {find}\) and \(\mathtt {merge}\), proposed in [18]. This phase, shown in Algorithm 1 l.21–l.24, completes when \(\mathtt {coreBucekts}\) is empty.

Phase IV (data labeling): Each non-labeled core-point in a core-bucket gets its clustering label after its root \(\mathtt {ID}\) in the core-forest. All other non-labeled points in a core-bucket are labeled with the root \(\mathtt {ID}\) of the associated core-point. The process, shown in Algorithm 1 l.25–l.30, is performed concurrently for all core-buckets.

4 Analysis

This section analytically studies \(\mathtt {IP.LSH.DBSCAN}\)’s accuracy, safety and completeness properties, and completion time. We provide sketch proofs to save space, but formal proofs can be found in [19]. Figure 1b summarizes the notations.

Accuracy Analysis

Let \(\mathtt {p_\epsilon }\) be the probability that two given points with maximum distance \(\epsilon \) have the same hash value using \(\mathcal {H}\) (see Definition 1). Lemma 1 shows that any point identified as a core-point by \(\mathtt {IP.LSH.DBSCAN}\) is a true core-point in terms of DBSCAN. The following Lemma provides a lower bound on the probability that \(\mathtt {IP.LSH.DBSCAN}\) identifies any DBSCAN core-point.

Lemma 2

Let \(\mathtt {c}\) be a DBSCAN’s core-point. The probability that \(\mathtt {IP.LSH.DBSCAN}\) also identifies \(\mathtt {c}\) as a core-points is at least \(1 - \left( 1 - \frac{1}{{\chi {}}} {{ {\mathtt {p_\epsilon ^{{\mathtt {M}{}}({{ {\mathtt {minPts}}}}-1)}}}}}\right) ^ {\mathtt {L}{}}\), where \(\chi \) denotes the maximum number of points in any bucket.

Proof. (sketch) First note that with \(\mathtt {M}\) hash functions per table (i.e., the AND construction in Definition 2), the probability that two given points with maximum distance \(\epsilon \) collide into the same hash bucket in a fixed hash table is \(\mathtt {p_\epsilon ^{\mathtt {M}{}}}\). The probability that \(\mathtt {c}\) gets identified as a core-point in a fixed hash table is at leaset \(\frac{1}{{\chi {}}} {{ {\mathtt {p_\epsilon ^{{\mathtt {M}{}}({{ {\mathtt {minPts}}}}-1)}}}}}\) because at least \(\mathtt {minPts}\)-1 \(\epsilon \)-neighbours of \(\mathtt {c}\) must get hashed to \(\mathtt {b}\), and \(\mathtt {c}\) should be the closest point to the centroid of the points in \(\mathtt {b}\). Finally, the probability that \(\mathtt {c}\) gets identified as a core-point in at least one hash table is computed as the complement of the probability that \(\mathtt {c}\) does not get identified as a core-point in any hash table.

Lemma 2 shows that the probability of identifying any DBSCAN core-point can be made arbirtarily close to 1 by choosing sufficiently large \(\mathtt {L}\).

Lemma 3

Let \(\mathtt {c_1}\) and \(\mathtt {c_2}\) be two core-points identified by \(\mathtt {IP.LSH.DBSCAN}\).

  1. 1.

    If \(\mathtt {Dist}\)\(({{ {\mathtt {c_1}}}}, {{ {\mathtt {c_2}}}}) > \epsilon \), then \(\mathtt {IP.LSH.DBSCAN}\) does not merge \(\mathtt {c_1}\) and \(\mathtt {c_2}\).

  2. 2.

    If \(\mathtt {Dist}\)\(({{ {\mathtt {c_1}}}}, {{ {\mathtt {c_2}}}}) \le \epsilon \), then the probability that \(\mathtt {IP.LSH.DBSCAN}\) merges \(\mathtt {c_1}\) and \(\mathtt {c_2}\) is at least \(2{{ {\mathtt {p_\epsilon ^{{\mathtt {M}{}}}}}}} - {{ {\mathtt {p_\epsilon ^{2{\mathtt {M}{}}}}}}}\).

Proof. (sketch) 1. follows directly from \(\mathtt {IP.LSH.DBSCAN}\)’s algorithmic description (see Algorithm 1 l.24). 2. follows from calculating the probability that \(\mathtt {c_2}\) hashes into the same bucket in which \(\mathtt {c_1}\) is the representative, and vice-versa.

Safety and Completeness Properties

At the end of phase IV, each set in the core-forest maintained by \(\mathtt {IP.LSH.DBSCAN}\) contains a subset of density-reachable core-points (as defined in Sect. 2). Two disjoint-set structures \(ds_1, ds_2\) are equivalent if there is a one-to-one correspondence between \(ds_1\)’s and \(ds_2\)’s sets. The following lemma implies that the outcomes of single-threaded and concurrent executions of \(\mathtt {IP.LSH.DBSCAN}\) are equivalent.

Lemma 4

Any pair of concurrent executions of \(\mathtt {IP.LSH.DBSCAN}\) that use the same hash functions, produce equivalent core-forests at the end of phase IV.

Proof. (sketch) Considering a fixed instance of the problem, any concurrent execution of \(\mathtt {IP.LSH.DBSCAN}\) identifies the same set of core-points and core-buckets with the same hash functions, hence performing the same set of \(\mathtt {merge}\) operations. As the concurrent executions of \(\mathtt {merge}\) operations are linearizable (see Sect. 3) and \(\mathtt {merge}\) operation satisfies the associative and commutative properties, the resulting sets in the core-forest are identical for any concurrent execution.

It is worth noting that border points (i.e., non-core-points within the vicinity of multiple core-points) can be assigned to any of the neighbouring clusters. The original DBSCAN [11] exhibits the same behaviour as well.

Completion Time Analysis

Lemma 5

[adapted from Theorem 2 in [18]] The probability that each \(\mathtt {findRoot}\) and each \(\mathtt {merge}\) perform \(\mathcal {O}(\log {\mathtt {C}{}})\) steps is at least \(1-\frac{1}{{\mathtt {C}{}}}\), where \(\mathtt {C}\) is the number of identified core-points by \(\mathtt {IP.LSH.DBSCAN}\).

Corollary 1

The expected asymptotic time complexity of each \(\mathtt {findRoot}\) and each \(\mathtt {merge}\) is \(\mathcal {O}\left( \log {{\mathtt {C}{}}}\right) \).

Lemma 6

The expected completion time of phase I is \(\mathcal {O}(\frac{{\mathtt {L}{}}{\mathtt {M}{}}{\mathtt {N}{}}{{ {\mathtt {d}}}{}}}{{\mathtt {K}{}}})\); phase II and phase III is bounded by \(\mathcal {O}(\frac{{\mathtt {L}{}}{\mathtt {N}{}}\log {{\mathtt {C}{}}}}{{\mathtt {K}{}}})\); phase IV is \(\mathcal {O}(\frac{{\mathtt {N}{}}\log {{\mathtt {C}{}}}}{{\mathtt {K}{}}})\).

Theorem 1

The expected completion time of \(\mathtt {IP.LSH.DBSCAN}\) is \(\mathcal {O}(\frac{{\mathtt {L}{}}{\mathtt {M}{}}{\mathtt {N}{}}{{ {\mathtt {d}}}{}}+{\mathtt {L}{}}{\mathtt {N}{}}\log {{\mathtt {C}{}}}}{{\mathtt {K}{}}})\).

Theorem 1 is derived by taking the asymptotically dominant terms in Lemma 6. It shows \(\mathtt {IP.LSH.DBSCAN}\)’s expected completion time is inversely proportional to \(\mathtt {K}\) and grows linearly in \(\mathtt {N}\), \(\mathtt {d}\), \(\mathtt {L}\), and \(\mathtt {M}\). In common cases where \(\mathtt {C}\) is much smaller than \(\mathtt {N}\), the expected completion time is \(\mathcal {O}(\frac{{\mathtt {L}{}}{\mathtt {M}{}}{\mathtt {N}{}}{{ {\mathtt {d}}}{}}}{{\mathtt {K}{}}})\); In the worst-case, where \(\mathtt {C}\) is \(\mathcal {O}\left( {\mathtt {N}{}}\right) \), the expected completion time is \(\mathcal {O}(\frac{{\mathtt {L}{}}{\mathtt {M}{}}{\mathtt {N}{}}{{ {\mathtt {d}}}{}}+{\mathtt {L}{}}{\mathtt {N}{}}\log {{\mathtt {N}{}}}}{{\mathtt {K}{}}})\). For this to happen, for instance, \(\epsilon \) and \(\mathtt {minPts}\) need to be extremely small and \(\mathtt {L}\) be extremely large. As the density parameters of DBSCAN are chosen to detect meaningful clusters, such choices for \(\epsilon \) and \(\mathtt {minPts}\) are in practice avoided.

On the memory use of \(\mathtt {IP.LSH.DBSCAN}\) : The memory footprint of \(\mathtt {IP.LSH.DBSCAN}\) is proportional to \(\left( {\mathtt {L}{}}{\mathtt {N}{}} + {\mathtt {N}{}}{{ {\mathtt {d}}}{}} \right) \), as it simply needs only one copy of each data point and pointers in the hash tables and this dominates the overhead of all other utilized data structures. Further, in-place operations ensure that data is not copied and transferred unnecessarily, which is a significant factor regarding efficiency. In Sect. 5, the effect of these properties is discussed.

Choice of \(\mathtt {L}\) and \(\mathtt {M}\) : For an LSH structure, a plot representing the probability of points hashing into the same bucket as a function of their distance resembles an inverse s-curve (x- and y-axis being the distance, and the probability of hashing to the same bucket, resp.), starting at 1 for the points with distance 0, declining with a significant slope around some threshold, and approaching 0 for far apart points. Choices of \(\mathtt {L}\) and \(\mathtt {M}\) directly influence the shape of the associated curve, particularly the location of the threshold and the sharpness of the decline [23]. It is worth noting that steeper declines generally result in more accurate LSH structures at the expense of larger \(\mathtt {L}\) and \(\mathtt {M}\) values. Consequently, in \(\mathtt {IP.LSH.DBSCAN}\), \(\mathtt {L}\) and \(\mathtt {M}\) must be determined to (i) set the location of the threshold at \(\epsilon \), and (ii) balance the trade-off between the steepness of the decline and the completion time. In Sect. 5, we study a range of \(\mathtt {L}\) and \(\mathtt {M}\) values and their implications on the trade-off between \(\mathtt {IP.LSH.DBSCAN}\)’s accuracy and completion time.

5 Evaluation

We conduct an extensive evaluation of \(\mathtt {IP.LSH.DBSCAN}\), comparing with the established state-of-the-art algorithms. Our implementation is publicly available [19]. Complementing Theorem 1, we measure the execution latency with varying number of threads (\(\mathtt {K}\)), data points (\(\mathtt {N}\)), dimensions (\(\mathtt {d}\)), hash tables (\(\mathtt {L}\)), and hash functions per table (\(\mathtt {M}\)). We use varying \(\epsilon \) values, as well as Euclidean and angular distances. We measure \(\mathtt {IP.LSH.DBSCAN}\)’s accuracy against the exact DBSCAN (hence also the baseline state-of-the-art algorithms) using rand index.

Setup: We implemented \(\mathtt {IP.LSH.DBSCAN}\) in C++, using POSIX threads and the concurrent hash table of Intel’s threading building blocks library (TBB). We used a c5.18xlarg AWS machine, with 144 GB of memory and two Intel Xeon Platinum 8124M CPUs, with 36 two-way hyper-threaded cores [33] in total.

Tested Methods: In addition to \(\mathtt {IP.LSH.DBSCAN}\), we benchmark PDSDBSCAN [24], HPDBSCAN [13], and the exact algorithm in [33], for which we use the label TEDBSCAN (Theoretically-Efficient and Practical Parallel DBSCAN). As the approximate algorithms in [33] are generally not faster than their exact counterpart (see Fig. 9 and discussion on p. 13 in [33]), we consider their efficiency represented by the exact TEDBSCAN. We also benchmark \(\mathtt {VLSHDBSCAN}\), our version of a single-thread DBSCAN that uses LSH indexing, as we did not find open implementations for [27, 35]. Benchmarking \(\mathtt {VLSHDBSCAN}\) allows a comparison regarding the approximation degree, as well as the efficiency induced by \(\mathtt {IP.LSH.DBSCAN}\)’s “fused” approach. Section 2 covers the aforementioned algorithms.

Evaluation Data and Parameters

Following common practices [13, 28, 33], we use datasets with different characteristics. We use varying \(\epsilon \) but fixed \(\mathtt {minPts}\), as the sensitivity on the latter is significantly smaller [28]. We also follow earlier works’ common practice to abort any execution that exceeds a certain bound, here \(9\,\times \,10^5\) sec (more than 24 h). We introduce the datasets and the chosen values for \(\epsilon \) and \(\mathtt {minPts}\) as well as the choices for \(\mathtt {L}\) and \(\mathtt {M}\), based on the corresponding discussion in Sect. 4 and also the literature guidelines (e.g., [23] and the reference therein). The default \(\epsilon \) values are shown in italics.

TeraClickLog [33]: Each point in this dataset corresponds to a display ad by Criteo with 13 integer and 26 categorical features. We use a subset with over 67 million points, free from missing features. Like [33], we only consider the integer features, and we choose \(\epsilon \) from \(\{1500 , 3000, 6000, 12000\}\) and \(\mathtt {minPts}\) 100.

Household [12]: This is an electricity consumption dataset with over two million points, each being seven-dimensional after removing the date and time features (as suggested in [12]). Following the practice in [12, 33], we scale each feature to [0,10000] interval and choose \(\epsilon \) from \(\{1500, 2000 , 2500, 3000\}\) and \(\mathtt {minPts}\) 100.

GeoLife [36]: From this GPS trajectory dataset, we choose ca 1.5 million points as selected in [20], containing latitude and longitude with a highly skewed distribution. We choose \(\epsilon \) from \(\{0.001 ,0.002,0.004,0.008\}\) and \(\mathtt {minPts}\) 500, like [20].

MNIST: This dataset contains 70000 28\(\,\times \,\)28-pixel hand-written and labelled 0–9 digits [22]. We treat each record as a 784-dimensional data point, normalizing each point to have a unit length (similar to [30]). We utilize the angular distance. Following [29], we choose \(\epsilon \) from \(\{0.18\pi , 0.19\pi , 0.20 \pi , 0.21\pi \}\) and \(\mathtt {minPts}\) 100.

The heat-maps in Fig. 2a–Fig. 2d visualize \(\mathtt {IP.LSH.DBSCAN}\)’s rand index accuracy as a function of \(\mathtt {L}\) and \(\mathtt {M}\). For TeraClickLog (Fig. 2a), \(\{\) \(\mathtt {L}\) = 5, \(\mathtt {M}\) = 5\(\}\), \(\{\) \(\mathtt {L}\) = 10, \(\mathtt {M}\) = 5\(\}\), and \(\{\) \(\mathtt {L}\) = 20, \(\mathtt {M}\) = 5\(\}\) give 0.98, 0.99, and 1 accuracies, respectively. For Household (Fig. 2b), \(\{\) \(\mathtt {L}\) = 5, \(\mathtt {M}\) = 5\(\}\), \(\{\) \(\mathtt {L}\) = 10, \(\mathtt {M}\) = 5\(\}\), and \(\{\) \(\mathtt {L}\) = 20, \(\mathtt {M}\) = 5\(\}\) give 0.92, 0.94, and 0.95 accuracies, respectively. For GeoLife (Fig. 2c), \(\{\) \(\mathtt {L}\) = 5, \(\mathtt {M}\) = 2\(\}\), \(\{\) \(\mathtt {L}\) = 10, \(\mathtt {M}\) = 2\(\}\), and \(\{\) \(\mathtt {L}\) = 20, \(\mathtt {M}\) = 2\(\}\) give 0.8, 0.85, and 0.89 accuracies, respectively. For (Fig. 2d) dataset, \(\{\) \(\mathtt {L}\) = 58, \(\mathtt {M}\) = 9\(\}\), \(\{\) \(\mathtt {L}\) = 116, \(\mathtt {M}\) = 9\(\}\), and \(\{\) \(\mathtt {L}\) = 230, \(\mathtt {M}\) = 9\(\}\) give 0.77, 0.85, and 0.89 accuracy, respectively, computed with respect to the actual labels.

Fig. 2.
figure 2

Visualizing the rand index accuracy of \(\mathtt {IP.LSH.DBSCAN}\) as a function of \(\mathtt {L}\) and \(\mathtt {M}\)

Experiments for the Euclidean Distance

Completion Time with Varying \(\mathtt {K}\) : Figure 3a, Fig. 3b, and Fig. 3c show the completion time of \(\mathtt {IP.LSH.DBSCAN}\) and other methods with varying \(\mathtt {K}\) on TeraClickLog, Household, and Geolife datasets, respectively. PDSDBSCAN runs out of memory on TeraClickLog for all \(\mathtt {K}\) and on GeoLife for \(\mathtt {K}\)\(\,\,\ge 4\), and none of HPDBSCAN’s executions terminate within the \(9\,\times \, 10^5\) sec threshold. For the reference, in Fig. 3, the completion time of single-thread \(\mathtt {VLSHDBSCAN}\) is provided as a caption for each dataset, except for TeraClickLog as its completion time exceeds \(9\,\times \,10^5\) sec. The results indicate the benefits of parallelization for work-load distribution in \(\mathtt {IP.LSH.DBSCAN}\), also validating that \(\mathtt {IP.LSH.DBSCAN}\)’s completion time behavior is linear with respect to \(\mathtt {L}\), as shown in Theorem 1. For higher dimensionality, challenging the state-of-the-art algorithms, \(\mathtt {IP.LSH.DBSCAN}\)’s completion time is several orders of magnitude shorter.

Completion Time with Varying \(\epsilon \) : The left Y-axes in Fig. 4a, Fig. 4b, and Fig. 4c show the completion time of \(\mathtt {IP.LSH.DBSCAN}\) and other tested methods using 36 cores with varying \(\epsilon \) values on TeraClickLog, Household, and Geolife datasets, respectively. PDSDBSCAN crashes by running out of memory on TeraClickLog and GeoLife for all \(\epsilon \), and none of HPDBSCAN’s executions terminate within the \(9\,\times \, 10^5\) sec threshold. The right Y-axes in Fig. 4a, Fig. 4b, and Fig. 4c show the corresponding rand index accuracy of \(\mathtt {IP.LSH.DBSCAN}\). The results show that in general the completion time of \(\mathtt {IP.LSH.DBSCAN}\) decreases by increasing \(\epsilon \). Intuitively, hashing points into larger buckets results in lower \(\mathtt {merge}\) workload. Similar benefits, although with higher completion times, are seen for TEDBSCAN. On the other hand, as the results show, completion time of many classical methods (such as HPDBSCAN and PDSDBSCAN) increases with increasing \(\epsilon \).

Fig. 3.
figure 3

Completion time with varying \(\mathtt {K}\). The comma-separated values corresponding to \(\mathtt {IP.LSH.DBSCAN}\) and \(\mathtt {VLSHDBSCAN}\) show \(\mathtt {L}\), \(\mathtt {M}\), and the rand index accuracy, respectively. PDSDBSCAN crashes by running out of memory in 3a for all \(\mathtt {K}\) and for \(\mathtt {K}\)\(\,\,\ge 4\) in 3c. In 3a no HPDBSCAN executions terminate within the \(9\,\times \, 10^5\)-sec threshold.

Fig. 4.
figure 4

Completion time using varying \(\epsilon \) with 36 cores. PDSDBSCAN crashes by running out of memory in 4a and 4c for all \(\epsilon \). None of HPDBSCAN’s executions terminate within the \(9\,\times \, 10^5\) sec threshold in 4a. Right Y-axes show \(\mathtt {IP.LSH.DBSCAN}\)’s rand index.

Fig. 5.
figure 5

Completion time with varying \(\mathtt {N}\) using 36 cores. PDSDBSCAN runs out of memory in 5a with \(\mathtt {N}\)\(\,>\,\)0.1 million points and with \(\mathtt {N}\)\(\,>\,\)1 million points in 5c. \(\mathtt {IP.LSH.DBSCAN}\) and TEDBSCAN coincide in 5c. Right Y-axes show \(\mathtt {IP.LSH.DBSCAN}\)’s rand index.

Fig. 6.
figure 6

MNIST results with the angular distance (only \(\mathtt {IP.LSH.DBSCAN}\), \(\mathtt {VLSHDBSCAN}\), DBSCAN support the angular distance). 6a shows \(\mathtt {IP.LSH.DBSCAN}\)’s completion time with varying \(\mathtt {K}\). The left Y-axes in 6b and 6c respectively show \(\mathtt {IP.LSH.DBSCAN}\)’s completion time with varying \(\epsilon \) and \(\mathtt {N}\), using 36 cores. The right Y-axes in 6b and 6c show the associated accuracy, computed with respect to the actual labels.

Completion Time with Varying \(\mathtt {N}\) : The left Y-axes in Fig. 5a, Fig. 5b, and Fig. 5c show the completion time of the bench-marked methods using 36 cores on varying size subsets of TeraClickLog, Household, and Geolife datasets, respectively. PDSDBSCAN runs out of memory on TeraClickLog subsets with \(\mathtt {N}\)\(\,>\,\)0.1 million points and GeoLife subsets with \(\mathtt {N}\)\(\,>\,\)1 million points. The results empirically validate that completion time of \(\mathtt {IP.LSH.DBSCAN}\) exhibits a linear growth in the number of data points, complementing Theorem 1. The right Y-axes in Fig. 5a, Fig. 5b, and Fig. 5c show the corresponding rand index accuracy of \(\mathtt {IP.LSH.DBSCAN}\).

Experiments for the Angular Distance. For significantly high number of dimensions, as a side-effect of dimensionality curse, the Euclidean distance among all pairs of points is almost equal [23]. To overcome this, we use angular distance. We only study methods that support such a distance: \(\mathtt {IP.LSH.DBSCAN}\), \(\mathtt {VLSHDBSCAN}\), and \(\mathtt {DBSCAN}\). Here accuracy is calculated against the actual labels. Figure 6a shows completion time with varying \(\mathtt {K}\). The left Y-axes in Fig. 6b and Fig. 6c respectively show completion time with varying \(\epsilon \) and \(\mathtt {N}\), using 36 cores. The right Y-axes in Fig. 6b and Fig. 6c show the associated accuracies. Note \(\mathtt {IP.LSH.DBSCAN}\)’s completion time is more than 4 orders of magnitude faster than a sequential DBSCAN and more than 3 orders of magnitude faster than \(\mathtt {VLSHDBSCAN}\). Here, too the results align and complement Theorem 1’s analysis.

Discussion of Results. \(\mathtt {IP.LSH.DBSCAN}\) targets high dimensional, memory-efficient clustering for various distance measures. \(\mathtt {IP.LSH.DBSCAN}\)’s completion time is several orders of magnitude shorter than state-of-the-art counterparts, while ensuring approximation with tunable accuracy and showing efficiency for lower dimensional data too. In practice, \(\mathtt {IP.LSH.DBSCAN}\)’s completion time exhibits a linear behaviour with respect to the number of points, even for skewed data distributions and varying density parameters. The benefits of \(\mathtt {IP.LSH.DBSCAN}\) with respect to other algorithms increase with increasing data dimensionality. \(\mathtt {IP.LSH.DBSCAN}\) scales both with the size of the input and its dimensionality.

6 Other Related Work

Having compared \(\mathtt {IP.LSH.DBSCAN}\) with representative state-of-the-art related algorithms in Sect. 5, we focus on related work considering approximation. Gan et al. and Wang et al. in [12, 33] proposed approximate DBSCAN clustering for low-dimensional Euclidean distance, with \({\mathcal {O}}({\mathtt {N}}^2)\) complexity if \(2^{{ {\mathtt {d}}}{}}>{\mathtt {N}{}}\) [7]. The PARMA-CC [20] approach is also suitable only for low-dimensional data. \(\mathtt {VLSHDBSCAN}\) [27, 35] uses LSH for neighbourhood queries. However, the LSH index creation in \(\mathtt {IP.LSH.DBSCAN}\) is embedded into the dynamics of the clusters formation. \(\mathtt {IP.LSH.DBSCAN}\) iterates over buckets and it applies merges on core-points that represent bigger entities, drastically reducing the search complexity. Also, \(\mathtt {IP.LSH.DBSCAN}\) is a concurrent rather than a single-thread algorithm. Esfandiari et al. [10] propose an almost linear approximate DBSCAN that identifies core-points by mapping points into hyper-cubes and counting the points in them. It uses LSH to find and merge nearby core-points. \(\mathtt {IP.LSH.DBSCAN}\) integrates core-point identification and merging in one structure altogether, leading to better efficiency and flexibility in leveraging the desired distance function.

7 Conclusions

\(\mathtt {IP.LSH.DBSCAN}\) proposes a simple and efficient method combining insights on DBSCAN with features of LSH. It offers approximation with tunable accuracy and high parallelism, avoiding the exponential growth of the search effort with the number of data dimensions, thus scaling both with the size of the input and its dimensionality, and dealing with high skewness in a memory-efficient way. We expect \(\mathtt {IP.LSH.DBSCAN}\) will support applications in the evolving landscape of cyberphysical system data pipelines to aggregate information from large, high-dimensional, highly-skewed data sets [14, 16]. We also expect that this methodology can be used for partitioning data for other types of graph processing and as such this direction is worth investigating as extension of \(\mathtt {IP.LSH.DBSCAN}\).