New plane-sweep algorithms for distance-based join queries in spatial databases

Roumelis, George; Corral, Antonio; Vassilakopoulos, Michael; Manolopoulos, Yannis

doi:10.1007/s10707-016-0246-1

New plane-sweep algorithms for distance-based join queries in spatial databases

Published: 27 February 2016

Volume 20, pages 571–628, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

GeoInformatica Aims and scope Submit manuscript

New plane-sweep algorithms for distance-based join queries in spatial databases

Download PDF

George Roumelis¹,
Antonio Corral²,
Michael Vassilakopoulos ORCID: orcid.org/0000-0003-2256-5523³ &
…
Yannis Manolopoulos¹

595 Accesses
13 Citations
Explore all metrics

Abstract

Efficient and effective processing of the distance-based join query (DJQ) is of great importance in spatial databases due to the wide area of applications that may address such queries (mapping, urban planning, transportation planning, resource management, etc.). The most representative and studied DJQs are the K Closest Pairs Query (KCPQ) and εDistance Join Query (εDJQ). These spatial queries involve two spatial data sets and a distance function to measure the degree of closeness, along with a given number of pairs in the final result (K) or a distance threshold (ε). In this paper, we propose four new plane-sweep-based algorithms for KCPQs and their extensions for εDJQs in the context of spatial databases, without the use of an index for any of the two disk-resident data sets (since, building and using indexes is not always in favor of processing performance). They employ a combination of plane-sweep algorithms and space partitioning techniques to join the data sets. Finally, we present results of an extensive experimental study, that compares the efficiency and effectiveness of the proposed algorithms for KCPQs and εDJQs. This performance study, conducted on medium and big spatial data sets (real and synthetic) validates that the proposed plane-sweep-based algorithms are very promising in terms of both efficient and effective measures, when neither inputs are indexed. Moreover, the best of the new algorithms is experimentally compared to the best algorithm that is based on the R-tree (a widely accepted access method), for KCPQs and εDJQs, using the same data sets. This comparison shows that the new algorithms outperform R-tree based algorithms, in most cases.

Efficient large-scale distance-based join queries in spatialhadoop

Article 20 September 2017

A Comparison of Distributed Spatial Data Management Systems for Processing Distance Join Queries

A New Plane-Sweep Algorithm for the K-Closest-Pairs Query

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A Spatial Database is a database system that offers spatial data types in its data model and query language, and it supports spatial data types in its implementation, providing at least spatial indexing and efficient spatial query processing [2]. In a computer system, these spatial data are represented by points, line-segments, regions, polygons, volumes and other kinds of 2-d/3-d geometric entities and are usually referred to as spatial objects. For example, a spatial database may contain polygons that represent building footprints from a satellite image, or points that represent the positions of cities, or line segments that represent roads. Spatial databases include specialized systems like Geographical databases, CAD databases, Multimedia databases, Image databases, etc. Recently, the role of spatial databases is continuously increasing in many modern applications; e.g. mapping, urban planning, transportation planning, resource management, geomarketing, environmental modeling are just some of these applications.

The most basic use of such a system is for answering spatial queries related to the spatial properties of the data. Some typical spatial queries are: point query, range query, spatial join, and nearest neighbor query [3]. One of the most frequent spatial queries in spatial database systems is the spatial join, which finds all pairs of spatial objects from two spatial data sets that satisfy a spatial predicate, 𝜃. Some examples of the spatial predicate 𝜃 are: intersects, contains, is_enclosed_by, distance, adjacent, meets, etc. [4]; and when 𝜃 is a distance, we have distance-based join queries (DJQ). The most representative and studied DJQ in the spatial database field are the K Closest Pairs Query (KCPQ) and εDistance Join Query (εDJQ). The KCPQ combines join and nearest neighbor queries: like a join query, all pairs of objects are candidates for the final result, and like a nearest neighbor query, the K Nearest Neighbor property is the basis for the final ordering [5, 6]. The ε DJQ, also known as Range Distance Join, also involves two spatial data sets and a distance threshold ε, and it reports a set pairs of objects, one from each input set, that are within distance ε of each other. DJQ are very useful in many applications that use spatial data for decision making and other demanding data handling operations. For example, we can use two spatial data sets that represent the cultural landmarks and the most populated places of the United States of America. A KCPQ (K = 10) can discover the 10 closest pairs of cities and cultural landmarks providing an increasing order based on their distances. On the other hand, a εDJQ (ε = 10) will return all possible pairs (populated place, cultural landmark) that are within 10 kilometers of each other.

The distance functions are typically based on a distance metric (satisfying the non-negative, identity, symmetry and Δ-inequality properties) defined on points in the data space. A general distance metric is called L _t-distance or Minkowski distance between two points, in the d-dimensional data space, D^d. For t = 2 we have the Euclidean distance, for t = 1 the Manhattan distance and for $t = \infty $ the Maximum distance. They are the most known L _t-distances. Often, the Euclidean distance is used as the distance function but, depending on the application, other distance functions may be more appropriate. The d-dimensional Euclidean space, E^d, is the pair $(\mathrm {D}^{d}, \mathrm {L}_{2})$. That is, E^d is D^d with the Euclidean distance L ₂. In the following we will use dist instead of L ₂ as the Euclidean distance between two points in E^d and this will be the basis for DJQs studied on this paper.

One of the most important techniques in the computational geometry field is the plane-sweep algorithm which is a type of algorithm that uses a conceptual sweepline to solve various problems in the Euclidean plane, E², [7]. The name of plane-sweep is derived from the idea of sweeping the plane from left to right with a vertical line (front) stopping at every transaction point of a geometric configuration to update the front. All processing is carried out with respect to this moving front, without any backtracking, with a look-ahead on only one point each time [8]. The plane-sweep technique has been successfully applied in spatial query processing, mainly for intersection joins, regardless whether both spatial data sets are indexed or not [9]. In the context of DJQ the plane-sweep technique has been used to restrict all possible combinations of pairs of points from the two data sets. That is, using this technique instead of the brute-force nested loop algorithm, the reduction of the number of Euclidean distances computations has been proven [6, 10], and thus the reduction of execution time of the query processing.

It is generally accepted that indexing is crucial for efficient processing of spatial queries. Even more, it is well-known that a spatial join is generally fastest if both data sets are indexed. However, there are many situations where indexing does not necessarily pay off. In particular, the time needed to build the index before the execution of the spatial query plays an important role in the global performance of the spatial database systems. For instance, if the output of a spatial query serves as input to another spatial query, and such an output is not reused several times for subsequent spatial queries, then it may not be worthwhile to spend the time for building a new index. This is especially emphasized for spatial intersection joins that make use of indexes which need a long time to be built (e.g. R*-tree [11]) [12]. For the previous reasons, the time necessary to build the indexes is an important constraint, especially if the input data sets are not used often for spatial query processing. Thus the main motivation of this article is to propose new algorithms for DJQs (the KCPQ and εDJQ) on disk resident data, when none inputs are indexed, and to study their behavior in the context of spatial databases. Our proposal is also motivated by the work of [13, 14] for spatial intersection joins.

Nowadays, the unnecessity of indexes for query processing is not infrequent in practical applications, when the data sets change at a very rapid rate, or the data sets are not reusable for subsequent queries and the use of indexes can be omitted. Moreover, disk-based solutions are necessary, since main memory of a computing system is, in many cases, shared among applications, and it is usually not enough to hold big data (although, main memory increases in size and decreases in cost, acquired data increase at higher rates than main memory, for example, scientific data). As a possible application scenario, consider cadastre, or urban planning very big data sets with spatial and non-spatial characteristics. Big subsets of the data sets may be formed by considering certain (mainly non-spatial) characteristics of the stored properties, or buildings (like, properties owned by the state, buildings higher than 50 meters, constructions older than 50 years, or built under an obsolete anti-seismic construction standard, non build-up large areas, etc). These (big and non-storable in main memory) subsets are dynamic, or non-reusable, in the sense that an engineer, or an official may create them by setting conditions for certain characteristics, use them to answer a query, modify these conditions (and the created subsets), answer again this query, and so on. In the process of conducting a study, like an emergency planning study, the DJQ of interest might be to find pairs of buildings vulnerable by an earthquake and earthquake-safe public buildings that could temporarily host people, at a limited distance.

This paper substantially extends our previous work [1] and its contributions are summarized as follows:

1.
We present theorems (the proofs of these theorems are included in [15]) regarding the correctness of both algorithms for KCPQ, that is, Classic Circle Plane-Sweep (CCPS) and Reverse Run Circle Plane-Sweep (RCPS) algorithms. They are the basis of the following algorithms for DJQ, when neither inputs are indexed and the data are stored on disk.
2.
There are many contributions in the context of spatial intersection joins when both, one, or neither inputs are indexed. For DJQs most of the contributions have been proposed when both inputs are indexed (mainly using R-trees for KCPQ). For this reason, in this article we propose four algorithms (FCCPS, SCCPS, FRCPS and SRCPS) for KCPQs and their extensions for εDJQs for performing DJQs, without the use of an index on any of the two disk-resident data sets. These algorithms employ a combination of the plane-sweep algorithms (CCPS) and (RCPS) and space partitioning techniques (uniform splitting and uniform filling) to join the disk-resident data sets.
3.
We present results of an extensive experimental study, that compares the performance (in terms of efficiency and effectiveness) of the proposed algorithms.
4.
We also compare the performance (efficiency) of the best of the new algorithms to the best algorithm that is based on the R-tree (a widely accepted access method).

The rest of this paper is organized as follows. Section 2 defines the KCPQ and εDJQ, which are the queries studied on this paper, in the context of spatial databases. Moreover a classification of spatial join and distance-based join queries taking into account whether both, one, or neither inputs are indexed is presented. The Classic Plane-Sweep algorithm for DJQs is described in Section 3, as well as two improvements to reduce the number of distance computations. In Section 4, the new plane-sweep algorithm (Reverse Run Plane-Sweep, RRPS) for KCPQ is presented. In Section 5, we present and analyse the new plane-sweep-based algorithms for the KCPQ and εDJQ. Section 6 exposes the results of an extensive experimental study, taking into account different parameters for comparison. Moreover, Section 6 exposes the results of an extensive experimental comparison between the best of the new algorithms and the best R-tree based algorithm. Section 7 contains some concluding remarks and makes suggestions for future research.

2 Preliminaries and related work

Given two spatial data sets and a distance function to measure the degree of closeness, DJQs between pairs of spatial objects are important joins queries that have been studied actively in the last years. Section 2.1 defines the KCPQ and εDJQ, which are the kernel of this paper. Section 2.2 describes a classification of spatial join and distance-based join queries taking into account whether both, one, or neither inputs are indexed, along with the review of other recent contributions related to these DJQs.

2.1 K closest pairs query and εdistance join query

In spatial database applications, the nearness or farness of spatial objects is examined by performing distance-based queries (DBQs). The most known DBQs in the spatial database framework when just a spatial data set is involved are the range query (RQ) and the K Nearest Neighbors query (KNNQ). When we have two spatial data sets the most representative DBQ are the K Closest Pairs Query (KCPQ) and the εDistance Join Query (εDJQ). They are considered DJQs, because they involve two different spatial data sets and use distance functions to measure the degree of nearness between spatial objects. The former reports only the top K pairs, and the latter, also known as Range Distance Join, finds all the possible pairs of spatial objects, having a distance between ε ₁ and ε ₂ of each other (ε ₁ ≤ ε ₂). Their formal definitions for point data sets (the extension of these definitions to other complex spatial objects is straightforward) are the following:

Definition 1

(K Closest Pairs Query, K CPQ) Let P = {p ₀, p ₁,⋯ , p _n−1} and Q = {q ₀, q ₁,⋯ , q _m−1} be two set of points in E^d, and a natural number K ($K \in \mathbb {N}, K > 0$). The K Closest Pairs Query (K CPQ)) of P and Q ($KCPQ(P, Q, K) \subseteq P \times Q$) is a set of K different ordered pairs K C P Q(P, Q, K) = {(p _Z1, q _L1),(p _Z2, q _L2),⋯ ,(p _{Z
K}, q _{L
K})}, with (p _{Z
i}, q _{L
i}) ≠ (p _{Z
j}, q _{L
j}), Z i≠Z j ∧ L i≠L j, such that for any (p, q)∈P×Q−{(p _Z1, q _L1),(p _Z2, q _L2),⋯ ,(p _{Z
K}, q _{L
K})} we have d i s t(p _Z1, q _L1) ≤ d i s t(p _Z2, q _L2) ≤ ⋯≤d i s t(p _{Z
K}, q _{L
K}) ≤ d i s t(p, q).

Definition 2

(ε Distance Join Query, ε DJQ) Let P = {p ₀, p ₁,⋯ , p _n−1} and Q = {q ₀, q ₁,⋯ , q _m−1} be two set of points in E^d, and a range of distances defined by [ε ₁, ε ₂] such that $\varepsilon _{1}, \varepsilon _{2} \in \mathbb {R}^{+}$ and ε ₁ ≤ ε ₂. The εDistance Join Query (ε DJQ) of P and Q ($\varepsilon \textit {DJQ}(P, Q, \varepsilon _{1}, \varepsilon _{2}) \subseteq P \times Q$) is a set which contains all the possible pairs of points (p _i, q _j) that can be formed by choosing one point p _i∈P and one point of q _j∈Q, having a distance between ε ₁ and ε ₂ for each other: ε DJQ(P, Q, ε ₁, ε ₂) = {(p _i, q _j)∈P×Q:ε ₁ ≤ d i s t(p _i, q _j) ≤ ε ₂}.

These two DJQs have been actively studied in the context of R-trees [5, 6, 10, 16]. However, when the data sets are not indexed they have attracted similar attention.

2.2 Related work

This section presents a classification of the spatial join and distance-based join queries depending on one, both or neither inputs are indexed. Moreover, other related DJQ are also revised in the recent literature, in order to show the importance of this type of query in the context of spatial databases.

2.2.1 Spatial join

Spatial data processing is well-known to be both data and computing intensive. The spatial join is one of the most studied spatial query, where given two datasets of spatial objects in Euclidean space, it finds all pairs of spatial objects satisfying a given spatial predicate, such as intersects, contains, etc [4]. Various techniques, such as minimizing disk I/O overheads in spatial indexing and the two phase filter-refinement strategy in spatial joins have been proposed in [9]. During the past decades many algorithms for spatial joins where the datasets reside on disk have been proposed in the literature [9, 17, 18] and recently, several contributions in the context of in-memory spatial join have been proposed. In [19], the authors have developed TOUCH, a novel in-memory spatial join algorithm, inspired with previous works on disk-based approaches and the requirements of the computational neuroscientists. It combines hierarchical data-oriented partitioning, batch processing and filtering concepts, with the target to decrease the number of comparisons, execution time and memory footprint of a spatial join process. In [20], a thorough experimental performance study of several (ten) spatial join techniques in main memory is reported. The techniques are first optimized for in-memory performance and then studied in the same framework. This study suggests that specialized join strategies over simple index structures, such as Synchronous Traversal over R-trees, should be the methods of choice for the considered cases. In [21], the authors re-implement the worst performing technique presented in [20] without changing the underlying high-level algorithm and the conclusion is that the resulting re-implementation is capable of outperforming all the other techniques. It means substantial performance gains can be achieved by means of careful implementation. Finally, in [22] a thorough review of a wide range of in-memory data management and processing proposals and systems is presented, including both data storage systems and data processing frameworks. The authors give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing. In this paper, we are going to focus on disk-resident data, new algorithms for in-memory DJQs is a task for further research.

The spatial join is one of the most related and influential spatial queries with respect to DJQs in spatial databases and GIS. Depending on the existence of indexes or not, different spatial join algorithms have been proposed [23]. If both inputs are indexed, several contributions have been proposed, but the most influential one is the R-tree join algorithm (RJ) [24], due to its efficiency and the popularity of R-trees [11, 25]. RJ synchronously traverses both trees in a Depth-First order. Two optimization techniques were also proposed, search space restriction and plane-sweep, to improve the CPU speed and to reduce the cost of computing overlapping pairs between the nodes to be joined, respectively.

Most research after RJ, focused on spatial join processing when one or both inputs are non-indexed. In this category, the paper that is most closely related to our work is [14], where several spatial joins strategies when only one input data set is indexed are investigated. The main contribution is a method that modifies the plane-sweep algorithm. This approach reads the data pages from the index in a one-dimensional sorted order and inserts entire data pages into the sweep structure (i.e. in this case, one sweep structure will contain objects, while another sweep structure will contain data pages).

Directly related to this paper, when both data sets are non-indexed, are methods that involve sorting and external memory plane-sweep [12, 13], or spatial hash join algorithms [26], like partition based spatial merge join [27]. In [13] the Scalable Sweeping-Based Spatial Join, SSSJ, was proposed, that employs a combination of plane-sweep and space partitioning to join the data sets, and it works under the assumption that in most cases the limit of the sweepline will fit in main memory. In [27] a hash-join algorithm was presented, so called Partition Based Spatial Merge Join, that regularly partitions the space, using a rectangular grid, and hashes both inputs data sets into the partitions. It then joins groups of partitions that cover the same area using plane-sweep to produce the join results. Some objects from both sets may be assigned in more than one partitions, so the algorithm needs to sort the results in order to remove the duplicate pairs. Finally, [12] extends the SSSJ of [13] to process data sets of any size by using external memory, proposing a new join algorithm referred as iterative spatial join.

2.2.2 KCPQ and εDJQ

The problem of closest pairs has received significant research attention by the computational geometry community (see [28] for an exhaustive survey), when all data are stored into the main memory. However, when the amount of data is too large (e.g. when we are working with spatial databases) it is not possible to maintain these data structures in main memory, and it is necessary to store the data on disk. Here, we are going to review the KCPQ and εDJQ, focusing on whether the input data sets are indexed or not. We must emphasize that most of the contributions that have been published until now are focused on the case when both data sets are indexed on R-trees.

Remind that given two spatial data sets P and Q, the KCPQ asks for the K closest pairs of spatial objects in P×Q. If both P and Q are indexed by R-trees, the concept of synchronous tree traversal and Depth-First (DF) or Best-First (BF) traversal order can be combined for the query processing [5, 6, 16]. For a more detailed explanation of the processing of KCPQ-DF and KCPQ-BF algorithms on two R*-trees from the non-incremental point of view, see [6, 15]. In [16], incremental and non-recursive algorithms based on Best-First traversal using R-trees and additional priority queues for DJQs were presented. In [10], additional techniques as sorting and application of plane-sweep during the expansion of node pairs, and the use of the estimation of the distance of the K-th closest pair to suspend unnecessary computations of MBR distances are included to improve [16]. A Recursive Best-First Search (RBF) algorithm for DBQ between spatial objects indexed in R-trees was presented in [29], with an exhaustive experimental study that compares DF, BF and RBF for several distance-based queries (Range Distance, K-Nearest Neighbors, K-Closest Pairs and Range Distance Join). Recently, in [30], an extensive experimental study comparing the R*-tree and Quadtree-like index structures for K-Nearest Neighbors and K-Distance Join queries together with index construction methods (dynamic insertion and bulk-loading algorithm) is presented. It was shown that when data are static the R*-tree shows the best performance. However, when data are dynamic, a bucket Quadtree begins to outperform the R*-tree. This is due to, once the dynamic R*-tree algorithm is used, the overlap among MBRs increases with increasing data set sizes, and the R*-tree performance degrades.

In the case where just only one data set is indexed, recently in [31] a new algorithm has been proposed for KCPQs. The main idea is to partition the space occupied by the data set without an index into several cells or subspaces (according to the VA-File structure [32]) and to make use of the properties of a set of distance functions defined between two MBRs [6].

To the best og the authors knowledge, there are no papers in the literature of spatial databases that have addressed the problem of DJQs if both data sets are non-indexed, and for this reason this is the main motivation of this research work.

εDJQ, also known as Range Distance Join, is a generalization of the Buffer Query, which is characterized by two spatial data sets and a distance threshold ε, which permits search pairs of spatial objects from the two input data sets that are within distance ε from each other. In our case, the distance threshold is a range of distances defined by an interval of distance values [ε ₁, ε ₂] (e.g. if ε ₁ = 0 and ε ₂ > 0, then we have the definition of Buffer Query and if ε ₁ = ε ₂ = 0, then we have the spatial intersection join, which retrieves all different intersecting spatial object pairs from two distinct spatial data sets [9]). This query is also related to the similarity join in multidimensional databases [33], where the problem of deciding if two objects are similar is reduced to the problem of determining if two multidimensional points are within a certain distance of each other. In [34], the Buffer Query is solved for non-point (lines and regions) spatial data sets using R-trees, where efficient algorithms for computing the minimum distance for lines and regions, pruning techniques for filtering in a Depth-First search algorithm (performance comparisons with other search algorithms are not included), and extensive experimental results are presented. We must emphasize that there are no contributions in the literature of spatial databases for εDJQ when one or both inputs are non-indexed.

2.2.3 Other related distance-based join queries

Several DJQs have been studied in the literature which are related to KCPQ and εDJQ. In [35] a new index structure, called b R d n n−T r e e, to solve different distance-based join queries is proposed. Other variants of KCPQ have also been studied in the context of spatial databases. More specifically, approximate K closest pairs in high dimensional data [36, 37] and constrained K closest pairs [38] have been presented. In [39] the exclusive closest pairs problem is introduced (which is a spatial assignment problem) and several solutions that solve it in main memory are proposed, exploiting the space partitioning. In [40] a unified approach that supports a broad class of top-K pairs queries (i.e. K-closest pairs queries, K-furthest pairs queries, etc.) is presented. And recently, in In [41] an external-memory algorithm, called ExactMaxRS, for the maximizing range sum (MaxRS) problem is proposed. The basic processing scheme of ExactMaxRS follows the distribution sweep paradigm, which was introduced as an external version of the plane-sweep algorithm. Moreover, other related problem, the maximizing circular range sum (MaxCRS), is also studied and an approximation algorithm is presented, which uses the ExactMaxRS algorithm.

Other complex DJQs using R-trees have been studied in the literature of spatial databases, as Iceberg Distance Join [42], K Nearest Neighbors Join [43] queries, and closely related to DJQ processing is the All-Nearest-Neighbor (ANN) query [44]. For a more detailed review of this classification, see [15].

3 Plane-sweep in distance-based join queries

An important improvement for join queries is the use of the plane-sweep technique, which is a common technique for computing intersections [7]. The plane-sweep technique is applied in [8] to find the closest pair in a set of points which resides in main memory. The basic idea, in the context of spatial databases, is to move a line, the so-called sweepline, perpendicular to one of the axes, e.g. X-axis, from left to right, and processing objects (points or MBRs) as they are reached by such sweepline. We can apply this technique for restricting all possible combinations of pairs of objects from the two data sets. If we do not use this technique, then we must check all possible combinations of pairs of objects from the two data sets and process them. That is, using the plane-sweep technique instead of the brute-force nested loop algorithm, the reduction of CPU cost is proven (e.g. for intersection joins [12, 13, 24] and KCPQ [6, 10]).

3.1 Classic plane-sweep algorithm

In general, let’s assume that the spatial objects are points. The data sets are $\mathcal {P}$ and $\mathcal {Q}$ and they can be organized as arrays. Let’s also consider a distance threshold δ, which is the distance of the K-th pair found so far for the KCPQ (the initial value of δ is $\infty $), or the constant given maximum distance for the εDJQ. The Classic Plane-Sweep (CPS) algorithm consists of the following steps [1, 15]:

1.
It sorts the entries of the two arrays of points, based on the coordinates of one of the axes in (e.g. X-axis) in increasing order.
2.
After that, two pointers p and q are maintained initially pointing to the first entry for processing of each sorted array of points. Let the reference point be the point with the smallest X-value pointed by one of these two pointers, e.g. $\mathcal {P}$, then as reference point will be defined the p.
3.
Afterwards, the reference point must be paired up with the points stored in the other sorted array of points (called comparison points, $q \in \mathcal {Q}$) from left to right, satisfying d x≡q.x−p.x < δ, processing all comparison points as candidate pairs where the reference point is fixed. After all possible pairs of entries that contain the reference point have been paired up (i.e. the forward lookup stops when d x≡q.x−p.x ≥ δ is verified), the pointer of the reference array is increased to the next entry, the reference point is updated with the point of the next smallest X-value pointed by one of the two pointers, and the process is repeated until one of the sorted array of points is completely processed.

Highlight that Classic Plane-Sweep algorithm applies the distance function over the sweeping axis (in this case, the X-axis, dx) because in the plane-sweep technique, the sweep is only over one axis. Moreover, the search is only restricted to the closest points with respect to the reference point according to the current distance threshold (δ). No duplicated pairs are obtained, since the points are always checked over sorted arrays.

Clearly, the application of this technique can be viewed as a sliding vertical area on the sweeping axis with a width equal to the δ value starting from the reference point (i.e. [0, δ] in the X-axis), where we only choose all possible pairs of points that can be formed using the reference point and the comparison points that fall into the current vertical area (see Fig. 1). This figure shows the points of the data set $\mathcal {P}$ marked with filled circles and the points of the data set $\mathcal {Q}$ marked with empty circles. Their coordinates are shown in, Tables 1 and 2. Note that the ticks on axes are put every two units of length for both dimensions.

Table 1 The data set $\mathcal {P}$ with 16 points in X-sorted order

New plane-sweep algorithms for distance-based join queries in spatial databases

Abstract

Similar content being viewed by others

Efficient large-scale distance-based join queries in spatialhadoop

A Comparison of Distributed Spatial Data Management Systems for Processing Distance Join Queries

A New Plane-Sweep Algorithm for the K-Closest-Pairs Query

1 Introduction

2 Preliminaries and related work

2.1 K closest pairs query and εdistance join query

Definition 1

Definition 2

2.2 Related work

2.2.1 Spatial join

2.2.2 KCPQ and εDJQ

2.2.3 Other related distance-based join queries

3 Plane-sweep in distance-based join queries

3.1 Classic plane-sweep algorithm

3.2 Improving the classic plane-sweep algorithm

Theorem 1

3.3 Extension to εdistance join query

4 Reverse run plane-sweep algorithm for distance join queries

4.1 Reverse run plane-sweep algorithm for KCPQs

Theorem 2

4.2 Extension to εdistance join query

5 External sweeping-based distance join algorithms

5.1 The external sweeping-based KCPQ algorithms

5.2 Algorithms using uniform filling

5.2.1 The FCCPS algorithm

5.2.2 The FRCPS algorithm

5.3 Algorithms using uniform splitting

5.3.1 The SCCPS algorithm

5.3.2 The SRCPS algorithm

5.4 Analysis

5.5 Extension to εdistance join query

6 Performance evaluation

6.1 Experimental setup

6.2 The effect of the number of pairs (K)

6.2.1 The execution time

6.2.2 The number of the dx distance calculations

6.2.3 The number of the disk accesses (pages read)

6.3 The effect of the disk page size (pg)

6.3.1 The execution time

6.3.2 The number of dx distance calculations

6.3.3 The number of the disk accesses (pages read)

6.4 The effect of the size of strips (ss)

6.4.1 The execution time

6.4.2 The number of dx distance calculations

6.4.3 The number of the disk accesses (pages read)

6.5 The effect of the LRU buffer

6.6 Experimental results for εDJQ

6.6.1 The execution time

6.6.2 The number of dx distance calculations

6.6.3 The number of the disk accesses (pages read)

6.7 Effectiveness study

6.8 Performance comparison to R-trees

6.8.1 Experimental results for KCPQ

6.8.2 Experimental results for εDJQ

6.9 Conclusions from the experiments

7 Conclusions and future work

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation