1 Introduction

The closest-pair problem is one of the most fundamental problems in computational geometry and finds many applications, e.g., collision detection, similarity search, traffic control, etc. In this paper, we study a range-search version of the closest-pair problem called the range closest-pair (RCP) problem. Let \({\mathcal {X}}\) be a certain collection of ranges called query space. The RCP problem with query space \({\mathcal {X}}\) (or the \({\mathcal {X}}\)-RCP problem for short) aims to preprocess a given dataset S of points into a low-space data structure such that when a query range \(X \in {\mathcal {X}}\) is specified, the closest-pair in \(S \cap X\) can be reported efficiently. The motivation for the RCP problem is clear and similar to that of range search: in many situations, one is interested in local information (i.e., local closest-pairs) inside specified ranges rather than global information (i.e., global closest-pair) of the dataset.

The RCP problem is quite challenging due to a couple of reasons. First, in the RCP problem, the objects of interest are in fact point-pairs instead of single points, and in a dataset there is a quadratic number of point-pairs to be dealt with. Moreover, the RCP problem is non-decomposable in the sense that even if the query range \(X \in {\mathcal {X}}\) can be written as \(X=X_1 \cup X_2\), the closest-pair in \(S \cap X\) cannot be computed efficiently from the closest-pairs in \(S \cap X_1\) and \(S \cap X_2\). The non-decomposability makes many traditional range-search techniques inapplicable to the RCP problem, and thus makes the problem much more challenging.

The RCP problem in \({\mathbb {R}}^2\) has been studied in prior work over the last fifteen years, e.g., [1, 5, 6, 8, 9, 12, 13]. In this paper, we revisit this problem and make significant improvements to the existing solutions. Following the existing work, the query types considered in this paper are orthogonal ranges (specifically, quadrants, strips, rectangles) and halfplanes.

1.1 Our Contributions, Techniques, and Related Work

The closest-pair problem and range search are both classical topics in computational geometry; see [2, 10] for references. The RCP problem is relatively new. The best existing bounds in \({\mathbb {R}}^2\) and our new results are summarized in Table 1 (\(\mathsf {Space}\) refers to space cost and \(\mathsf {Qtime}\) refers to query time), and we give a brief explanation below.

Table 1 Summary of the best existing bounds and our new results for the RCP problem in \({\mathbb {R}}^2\) (each row corresponds to an RCP data structure for the corresponding query space)
  • Related Work The RCP problem for orthogonal queries was studied in [1, 5, 6, 8, 9, 12, 13], where [13] is a preliminary version of this paper. The best known solution for quadrant queries was given by [6], while [9] gave the best known solution for strip queries. For (orthogonal) rectangle queries, there are two best known solutions (in terms of worst-case bounds) given by [6] and [9] respectively. The above results only considered worst-case performance of the data structures. The authors of [6] for the first time applied average-case analysis to RCP data structures in the model where the data points are drawn independently and uniformly from the unit square. Unfortunately, they only gave a rectangle RCP data structure with low average-case preprocessing time, while its average-case space cost and query time are even higher than the worst-case counterparts of the data structure given by [9] (in fact, its worst-case space cost is super-quadratic). In fact, in terms of space cost and query time, no nontrivial average-case bounds were known for any kind of query before this paper. The RCP problem for halfplane query was studied in [1]. Two data structures were proposed. The second one, while having higher space cost and query time than the first one, can be built more efficiently than the first one (in \(O(n \log ^2\! n)\) time versus \(O(n^2 \log ^2\! n)\) time). Both data structures require (worst-case) super-linear space cost and polynomial query time. The paper [12] studies an approximate version of the RCP problem in which the answer pair returned is allowed to be slightly outside the query range. Under this approximation model, efficient data structures for disk and ball queries were proposed.

  • Our Contributions In this paper, we improve all the above results by giving new RCP data structures for various query types. The improvements can be seen in Table 1. In terms of worst-case bounds, the highlights are our rectangle RCP data structure which simultaneously improves the two best known results (given by [6] and [9]) and our halfplane RCP data structure which is optimal and significantly improves the bounds in [1]. Furthermore, by applying average-case analysis to our new data structures, we establish the first nontrivial average-case bounds for all the query types studied. Our average-case analysis applies to datasets generated not only in the unit square but also in an arbitrary axes-parallel rectangle. These average-case bounds demonstrate that our new data structures might have much better performance in practice than one can expect from the worst-case bounds. In addition, all of our new RCP data structures presented in Table 1 can be constructed in near-linear worst-case time. Specifically, the quadrant, strip, and halfplane RCP data structures can be built in \(O(n \log ^2 \!n)\) time, and the rectangle RCP data structure can be built in \(O(n \log ^7\! n)\) time.

  • Our Techniques An important notion in our techniques is that of a candidate pair, i.e., a pair of data points that is the answer to some RCP query. Our solutions for the quadrant and strip RCP problems use the candidate pairs to construct a planar subdivision and take advantage of point-location techniques to answer queries. The data structures themselves are simple, and our main technical contribution here occurs in the average-case analysis of the data structures. The analysis requires a nontrivial study of the expected number of candidate pairs in a random dataset, which is of both geometric and combinatorial interest. Our data structure for the rectangle RCP problem is more complicated; it is constructed by combining two simpler data structures, each of which partially achieves the desired bounds. The high-level framework of the two simpler data structures is identical: it first “decomposes” a rectangle query into four quadrant queries and then simplifies the problem via some geometric observations similar to those in the standard divide-and-conquer algorithm for the classical closest-pair problem. Also, the analysis of the data structures is technically interesting. Our solution for the halfplane RCP problem applies the duality technique to map the candidate pairs to wedges in the dual space and form a planar subdivision, which allows us to solve the problem by using point-location techniques on the subdivision, similarly to the approach for the quadrant and strip RCP problems. However, unlike the quadrant and strip cases, to bound the complexity of the subdivision here is much more challenging, which requires non-obvious observations made by properly using the properties of duality and the problem itself. The average-case bounds of the data structure follow from a technical result bounding the expected number of candidate pairs.

  • Organization Section 1.2 presents the notations and preliminaries that are used throughout the paper. Our solutions for quadrant, strip, rectangle, and halfplane queries are presented in Sects. 2, 3, 4, and 5, respectively. In Sect. 6, we conclude our results and give some open questions for future work. To make the paper more readable and preserve a high-level view of the main results, some technical proofs are deferred to Appendix A.

1.2 Notations and Preliminaries

We introduce the notations and preliminaries that are used throughout the paper.

  • Query Spaces The following notations denote various query spaces (i.e., collections of ranges in \({\mathbb {R}}^2\)): \({\mathcal {Q}}\) quadrants, \({\mathcal {P}}\) strips, \({\mathcal {U}}\) 3-sided rectangles, \({\mathcal {R}}\) rectangles, \({\mathcal {H}}\) halfplanes (quadrants, strips, 3-sided rectangles, rectangles under consideration are all axes-parallel). Define \({\mathcal {Q}}^\nearrow = \{[x,\infty ) \times [y,\infty ): x,y \in {\mathbb {R}}\} \subseteq {\mathcal {Q}}\) as the sub-collection of all northeast quadrants, and define \({\mathcal {Q}}^\nwarrow ,{\mathcal {Q}}^\searrow ,{\mathcal {Q}}^\swarrow \) similarly. Define \({\mathcal {P}}^\text {v} = \{[x_1,x_2] \times {\mathbb {R}}: x_1,x_2 \in {\mathbb {R}}\} \subseteq {\mathcal {P}}\) as the sub-collection of all vertical strips, and similarly \({\mathcal {P}}^\text {h}\) horizontal strips. If l is a vertical (resp., horizontal) line, an l-anchored strip is a vertical (resp., horizontal) strip containing l; define \({\mathcal {P}}_l \subseteq {\mathcal {P}}\) as the sub-collection of all l-anchored strips. Define \({\mathcal {U}}^\downarrow = \{[x_1,x_2] \times (-\infty ,y]: x_1,x_2,y \in {\mathbb {R}}\} \subseteq {\mathcal {U}}\) as the sub-collection of all bottom-unbounded rectangles, and define \({\mathcal {U}}^\uparrow ,{\mathcal {U}}^\leftarrow ,{\mathcal {U}}^\rightarrow \) similarly. If l is a non-vertical line, denote by \(l^\uparrow \) (resp., \(l^\downarrow \)) the halfplane above (resp., below) l; define \({\mathcal {H}}^\uparrow \subseteq {\mathcal {H}}\) (resp., \({\mathcal {H}}^\downarrow \subseteq {\mathcal {H}}\)) as the sub-collection of all such halfplanes.

  • Candidate Pairs For a dataset S and query space \({\mathcal {X}}\), a candidate pair of S with respect to \({\mathcal {X}}\) refers to a pair of points in S which is the closest-pair in \(S \cap X\) for some \(X \in {\mathcal {X}}\). We denote by \(\varPhi (S,{\mathcal {X}})\) the set of the candidate pairs of S with respect to \({\mathcal {X}}\). If l is a line, we define \(\varPhi _l(S,{\mathcal {X}}) \subseteq \varPhi (S,{\mathcal {X}})\) as the subset consisting of the candidate pairs that cross l (i.e., whose two points are on opposite sides of l).

  • Data Structures For a data structure \({\mathcal {D}}\), we denote by \({\mathcal {D}}(S)\) the data structure instance of \({\mathcal {D}}\) built on the dataset S. The notations \(\mathsf {Space}({\mathcal {D}}(S))\) and \(\mathsf {Qtime}({\mathcal {D}}(S))\) denote the space cost and query time (i.e., the maximum time for answering a query) of \({\mathcal {D}}(S)\), respectively.

  • Random Datasets If X is a region in \({\mathbb {R}}^2\) (or more generally in \({\mathbb {R}}^d\)), we write \(S \propto X^n\) to mean that S is a dataset of n random points drawn independently from the uniform distribution \(\mathsf {Uni}(X)\) on X. More generally, if \(X_1,\dots ,X_n\) are regions in \({\mathbb {R}}^2\) (or more generally in \({\mathbb {R}}^d\)), we write \(S \propto \prod _{i=1}^n X_i\) to mean that S is a dataset of n random points drawn independently from \(\mathsf {Uni}(X_1),\dots ,\mathsf {Uni}(X_n)\) respectively.

  • Other Notions For a point \(a \in {\mathbb {R}}^2\), we denote by a.x and a.y the x-coordinate and y-coordinate of a, respectively. For two points \(a,b\in {\mathbb {R}}^d\), we use \({\text {dist}}(a,b)\) to denote the Euclidean distance between a and b, and use [ab] to denote the segments connecting a and b (in \({\mathbb {R}}^1\) this coincides with the notation for a closed interval). We say \(I_1,\dots ,I_n\) are vertical (resp., horizontal) aligned segments in \({\mathbb {R}}^2\) if there exist \(r_1,\dots ,r_n,\alpha ,\beta \in {\mathbb {R}}\) such that \(I_i = \{r_i\} \times [\alpha ,\beta ]\) (resp., \(I_i=[\alpha ,\beta ] \times \{r_i\}\)). The length of a pair \(\phi = (a,b)\) of points is the length of the segment [ab]. For \(S \subseteq {\mathbb {R}}^2\) of size at least 2, the notation \(\kappa (S)\) denotes the closest-pair distance of S, i.e., the length of the closest-pair in S.

The following result regarding the closest-pair distance of a random dataset will be used to bound the expected number of candidate pairs with respect to various query spaces. The proof of this result (and also of several other results in the remaining sections) can be found in Appendix A.

Lemma 1.1

Let R be a rectangle of size \(\varDelta \times \varDelta '\) where \(\varDelta \le \varDelta '\), and \(A \propto R^m\). Then

$$\begin{aligned}{\mathbb {E}}[\kappa ^p(A)] = \varTheta \bigl (\max \bigl \{(\varDelta '/m^2)^p, (\sqrt{\varDelta \varDelta '}/m)^p \bigr \}\bigr )\quad \text { for any constant p>1}.\end{aligned}$$

In particular, if R is a segment of length \(\ell \), then \({\mathbb {E}}[\kappa ^p(A)] = \varTheta ((\ell /m^2)^p)\).

2 Quadrant Query

We consider the RCP problem for quadrant queries, i.e., the \({\mathcal {Q}}\)-RCP problem. In order to solve the \({\mathcal {Q}}\)-RCP problem, it suffices to consider the \({\mathcal {Q}}^\swarrow \)-RCP problem. Let \(S \subseteq {\mathbb {R}}^2\) be a dataset of size n. Suppose \(\varPhi (S,{\mathcal {Q}}^\swarrow ) = \{\phi _1,\dots ,\phi _m\}\) where \(\phi _i = (a_i,b_i)\), and assume \(\phi _1,\dots ,\phi _m\) are sorted in increasing order of their lengths. It was shown in [6] that \(m = O(n)\). We construct a mapping \(\varPhi (S,{\mathcal {Q}}^\swarrow ) \rightarrow {\mathbb {R}}^2\) as \(\phi _i \mapsto w_i\) where \(w_i = (\max \{a_i.x, b_i.x\},\max \{a_i.y, b_i.y\})\), and observe that for a query range \(Q \in {\mathcal {Q}}^\swarrow \), \(\phi _i\) is contained in Q iff \(w_i \in Q\). Let \(W_i\) be the northeast quadrant with vertex \(w_i\). Then we further have \(w_i \in Q\) iff \(q \in W_i\) where q is the vertex of Q. As such, the closest-pair in \(S \cap Q\) to be reported is \(\phi _\eta \) for \(\eta = \min {\{i: q \in W_i\}}\).

We create a planar subdivision \(\varGamma \), by successively overlaying \(W_1,\dots ,W_m\) (see Fig. 1). Note that the complexity of \(\varGamma \) is O(m), since overlaying each quadrant creates at most two vertices of \(\varGamma \). By the above observation, the answer for Q is \(\phi _i\) iff q is in the cell \(W_i \setminus \bigcup _{j=1}^{i-1} W_j\). Thus, we can use the optimal planar point-location data structures (e.g., [4, 7]) to solve the problem in O(m) space with \(O(\log m)\) query time. Since \(m = O(n)\), we obtain a \({\mathcal {Q}}\)-RCP data structure using O(n) space with \(O(\log n)\) query time in worst-case.

Fig. 1
figure 1

A subdivision induced by successively overlaying the quadrants

Next, we analyze the average-case performance of the above data structure. In fact, it suffices to bound the expected number of the candidate pairs. Surprisingly, we have the following poly-logarithmic bound.

Lemma 2.1

For a random dataset \(S \propto R^n\) where R is an axes-parallel rectangle, \({\mathbb {E}}[|\varPhi (S,{\mathcal {Q}})|] = O(\log ^2 \!n)\).

Using the above lemma, we can immediately conclude that our data structure uses \(O(\log ^2 \!n)\) space in average-case. The average-case query time is in fact \(O({\mathbb {E}}[{\log |\varPhi (S,{\mathcal {Q}})|}])\). Note that \({\mathbb {E}}[{\log x}] \le \log {\mathbb {E}}[x]\) for a positive random variable x, thus \({\mathbb {E}}[{\log |\varPhi (S,{\mathcal {Q}})|}] = O(\log \log n)\).

Finally, we consider how to build the above data structure efficiently. To this end, we need an efficient algorithm to overlay quadrants. Let \(W_1,\dots ,W_m\) be m northeast quadrants, and our goal to overlay them to obtain the subdivision \(\varGamma \) described above. (We can store \(\varGamma \) as a DCEL structure [3], for example.) We process \(W_1,\dots ,W_m\) in this order. When processing \(W_i\), we want to compute the cell \(W_i\setminus \bigcup _{j=1}^{i-1}W_j\). To this end, we maintain the union U of the quadrants that have been processed. Note that U is always a staircase shape and its boundary \(\partial U\) is an orthogonal chain from top-left to bottom-right. Therefore, we can use a (balanced) binary search tree \({\mathcal {T}}\) to maintain \(\partial U\), where the nodes store the segments of \(\partial U\) (which are vertical or horizontal) in the order they appear on \(\partial U\).

To compute the cell \(W_i \setminus \bigcup _{j=1}^{i-1}W_j\), we first find all the segments of \(\partial U\) that intersect \(W_i\), denoted by \(\sigma _1,\dots ,\sigma _k\). Note that \(\sigma _1,\dots ,\sigma _k\) must be consecutive on \(\partial U\), and we can find them by simply searching in the BST \({\mathcal {T}}\) in \(O({\log m+k})\) time. Once \(\sigma _1,\dots ,\sigma _k\) are found, the cell \(W_i \setminus \bigcup _{j=1}^{i-1} W_j\) can be directly computed and stored in the DCEL. After \(W_i\) is processed, we have to update \(\partial U\) in \({\mathcal {T}}\). We delete from \({\mathcal {T}}\) the segments among \(\sigma _1,\dots ,\sigma _k\) that are completely contained in \(W_i\), because they are no longer edges of \(\partial U\). Besides, among \(\sigma _1,\dots ,\sigma _k\), there are (at most) two segments partially contained in \(W_i\); we need to modify them in \({\mathcal {T}}\) as they get changed when we insert \(W_i\). Furthermore, there are (at most) two new segments appearing on \(\partial U\) due to the insertion of \(W_i\) (which correspond to a portion of \(\partial W_i\)), and we need to insert them into \({\mathcal {T}}\). All of the above work can be done in \(O(k \log m)\) time. Since k is at most the number of the edges of the cell \(W_i\setminus \bigcup _{j=1}^{i-1} W_j\) and the complexity of the final subdivision is linear in m, the overall time cost for processing \(W_1,\dots ,W_m\) is \(O(m \log m)\).

Using the above algorithm, if we already have the set \(\varPhi (S,{\mathcal {Q}}^\swarrow )\) of candidate pairs in hand, then our RCP data structure can be built in \(O(n \log n)\) time. Unfortunately, it is currently not known how to compute the candidate pairs (with respect to quadrants) efficiently. To handle this issue, we use a result by Abam et al. [1]. Before describing this result, we need to introduce the notion of spanners and local spanners. Let \(S \subseteq {\mathbb {R}}^2\) be a dataset and \(t \ge 1\) be a number. A geometric t-spanner (or t-spanner for short) of S is a set \(\varPsi \) of pairs of points in S such that if we regard the points in S as vertices and the pairs in \(\varPsi \) as edges (where the weight of each edge is equal to the Euclidean distance between the pair of points), the resulting graph \(G_\varPsi \) satisfies \(d_{G_\varPsi }(a,b) \le t{\text {dist}}(a,b)\) for all \(a,b \in S\), where \(d_{G_\varPsi }(a,b)\) denotes the shortest-path distance between a and b in \(G_\varPsi \). Let \({\mathcal {X}}\) be a query space consisting of ranges in \({\mathbb {R}}^2\). An \({\mathcal {X}}\)-local t-spanner of S is a set \(\varPsi \) of pairs of points in S such that for every \(X \in {\mathcal {X}}\), \(\varPsi _X\) is a t-spanner of \(S \cap X\), where \(\varPsi _X \subseteq \varPsi \) consists of the pairs in \(\varPsi \) whose two points are both inside X. Note that if \(t < 2\), then a t-spanner \(\varPsi \) of S must contain the closest pair of SFootnote 1. Therefore, if \(t < 2\), then an \({\mathcal {X}}\)-local t-spanner \(\varPsi \) of S must contains all candidate pairs of S with respect to \({\mathcal {X}}\), i.e., \(\varPhi (S,{\mathcal {X}}) \subseteq \varPsi \).

Exploiting the so-called semi-separated pair decomposition (SSPD), Abam et al. [1] showed that, given a dataset \(S \subseteq {\mathbb {R}}^2\) of size n, one can compute in \(O(n \log ^2\! n)\) time a \({\mathcal {Q}}^\swarrow \)-local t-spanner \(\varPsi \) of S for some \(t<2\) such that \(|\varPsi | = O(n \log n)\). As argued above, we have \(\varPhi (S,{\mathcal {Q}}^\swarrow ) \subseteq \varPsi \). To build our RCP data structure, instead of computing the set \(\varPhi (S,{\mathcal {Q}}^\swarrow )\) of candidate pairs, we directly use the pairs in \(\varPsi \) to create a planar subdivision described above.

Formally, let \(\varPsi = \{\phi _1,\dots ,\phi _M\}\) where \(\phi _1,\dots ,\phi _M\) are sorted in increasing order of their lengths, and let \(W_1,\dots ,W_M\) be their corresponding northeast quadrants. We overlay \(W_1,\dots ,W_M\) in order, and let \(\varGamma '\) be the resulting subdivision. We claim that \(\varGamma '\) is the same as the subdivision \(\varGamma \) constructed using the candidate pairs in \(\varPhi (S,{\mathcal {Q}}^\swarrow )\). Since \(\varPhi (S,{\mathcal {Q}}^\swarrow ) \subseteq \varPsi \), for any southwest quadrant \(Q \in {\mathcal {Q}}^\swarrow \), the closest-pair in \(S \cap Q\) is contained in \(\varPsi \), and in fact is the shortest element in \(\varPsi \) that is contained in Q. It follows that the corresponding cell of \(\phi \in \varPsi \) in \(\varGamma '\) is just the subset of \({\mathbb {R}}^2\) consisting of the vertices of all \(Q \in {\mathcal {Q}}^\swarrow \) such that the closest-pair in \(S \cap Q\) is \(\phi \). Therefore, the cell corresponding to any \(\phi \in \varPsi \setminus \varPhi (S,{\mathcal {Q}}^\swarrow )\) in \(\varGamma '\) must be empty, because \(\phi \) is not the closest-pair in any southwest quadrant. Also, the cell corresponding to any \(\phi \in \varPhi (S,{\mathcal {Q}}^\swarrow )\) in \(\varGamma '\) must be the same as the one in \(\varGamma \). So we have \(\varGamma ' = \varGamma \). Note that \(|\varPsi | = O(n \log n)\), thus the time for constructing \(\varGamma '\) is \(O(n \log ^2\! n)\), which is also the preprocessing time of our RCP data structure.

Theorem 2.2

There exists a \({\mathcal {Q}}\)-RCP data structure \({\mathcal {A}}\) such that:

  • For any \(S \subseteq {\mathbb {R}}^2\) of size n,

    $$\begin{aligned} { {\mathsf {Space}}}({\mathcal {A}}(S)) = O(n)\quad \text {and}\quad { {\mathsf {Qtime}}}({\mathcal {A}}(S)) = O(\log n). \end{aligned}$$
  • For a random \(S\propto R^n\) where R is the unit square or more generally an arbitrary axes-parallel rectangle,

    $$\begin{aligned} {\mathbb {E}}[{ {\mathsf {Space}}}({\mathcal {A}}(S))] = O(\log ^2\! n)\quad \text {and}\quad {\mathbb {E}}[{ {\mathsf {Qtime}}}({\mathcal {A}}(S))] = O(\log \log n). \end{aligned}$$

Furthermore, the above data structure can be constructed in \(O(n \log ^2\! n)\) worst-case time.

3 Strip Query

We consider the RCP problem for strip queries, i.e., the \({\mathcal {P}}\)-RCP problem. In order to solve the \({\mathcal {P}}\)-RCP problem, it suffices to consider the \({\mathcal {P}}^\text {v}\)-RCP problem. Let \(S \subseteq {\mathbb {R}}^2\) be a dataset of size n. Suppose \(\varPhi (S,{\mathcal {P}}^\text {v}) = \{\phi _1,\dots ,\phi _m\}\) where \(\phi _i = (a_i,b_i)\), and assume \(\phi _1,\dots ,\phi _m\) are sorted in increasing order of their lengths. It was shown in [9] that \(m = O(n \log n)\). We construct a mapping \(\varPhi (S,{\mathcal {P}}^\text {v}) \rightarrow {\mathbb {R}}^2\) as \(\phi _i \mapsto w_i\) where \(w_i = (\min \{a_i.x, b_i.x\},\max \{a_i.x, b_i.x\})\), and observe that for a query range \(P = [x_1,x_2] \times {\mathbb {R}} \in {\mathcal {P}}^\text {v}\), \(\phi _i\) is contained in P iff \(w_i\) is in the southeast quadrant \([x_1,\infty ) \times (-\infty ,x_2]\). Let \(W_i\) be the northwest quadrant with vertex \(w_i\). Then we further have \(w_i \in [x_1,\infty ) \times (-\infty ,x_2]\) iff \(p \in W_i\) where \(p = (x_1,x_2)\). As such, the closest-pair in \(S \cap P\) is \(\phi _\eta \) for \(\eta = \min {\{i: p \in W_i\}}\). Thus, as in Sect. 2, we can successively overlay \(W_1,\dots ,W_m\) to create a planar subdivision \(\varGamma \), and use point-location to solve the problem in O(m) space and \(O(\log m)\) query time. Since \(m = O(n \log n)\) here, we obtain a \({\mathcal {P}}\)-RCP data structure using \(O(n \log n)\) space with \(O(\log n)\) query time in worst-case.

Next, we analyze the average-case performance of our data structure. Again, it suffices to bound the expected number of the candidate pairs. For later use, we study here a more general case in which the candidate pairs are considered with respect to 3-sided rectangle queries.

Lemma 3.1

Let \(S \propto \prod _{i=1}^nI_i\) where \(I_1,\dots ,I_n\) are distinct vertical (resp., horizontal) aligned segments sorted from left to right (resp., from bottom to top). Suppose \(a_i \in S\) is the point drawn on \(I_i\). Then for \(i,j \in \{1,\dots ,n\}\) with \(i<j\) and \({\mathcal {X}} \in \{{\mathcal {U}}^\downarrow ,{\mathcal {U}}^\uparrow \}\) (resp., \({\mathcal {X}} \in \{{\mathcal {U}}^\leftarrow ,{\mathcal {U}}^\rightarrow \}\)),

$$\begin{aligned}{\text {Pr}}{[(a_i,a_j) \in \varPhi (S,{\mathcal {X}})]} = O\biggl (\frac{\log { (j-i)}}{(j-i)^2}\biggr ).\end{aligned}$$

From the above lemma, a direct calculation gives us the following corollary.

Corollary 3.2

For a random dataset \(S \propto R^n\) where R is an axes-parallel rectangle, \({\mathbb {E}}[|\varPhi (S,{\mathcal {U}})|] = \varTheta (n)\) and \({\mathbb {E}}[|\varPhi (S,{\mathcal {P}})|] = \varTheta (n)\).

Proof

Without loss of generality, assume \(R = [0,1] \times [0,\varDelta ]\). Since \(\varPhi (S,{\mathcal {P}}) \subseteq \varPhi (S,{\mathcal {U}})\) for any S, it suffices to show \({\mathbb {E}}[|\varPhi (S,{\mathcal {P}})|] = \varOmega (n)\) and \({\mathbb {E}}[|\varPhi (S,{\mathcal {U}})|] = O(n)\). The former is clear, since every pair of x-adjacent or y-adjacent points in S is a candidate pair with respect to \({\mathcal {P}}\). The latter can be shown using Lemma 3.1 as follows. We only need to bound \({\mathbb {E}}[|\varPhi (S,{\mathcal {U}}^\downarrow )|]\). We first show that if \(S \propto \prod _{i=1}^n I_i\) where \(I_1,\dots ,I_n\) are vertical aligned segments sorted from left to right, then \({\mathbb {E}}[|\varPhi (S,{\mathcal {U}}^\downarrow )|] = O(n)\). In fact, this follows directly from Lemma 3.1. Let \(a_i\) be the random point drawn on \(I_i\). Then

$$\begin{aligned}{\mathbb {E}}[|\varPhi (S,{\mathcal {U}}^\downarrow )|] = \sum _{i=1}^{n-1} \sum _{j=i+1}^n{\text {Pr}}{[(a_i,a_j) \in \varPhi (S,{\mathcal {U}}^\downarrow )]}.\end{aligned}$$

We plug in the bound \({\text {Pr}}{[(a_i,a_j) \in \varPhi (S,{\mathcal {U}}^\downarrow )] }= O(\log (j-i)/(j-i)^2)\) shown in Lemma 3.1 to the above equation. Noting the fact that \(\sum _{t=1}^\infty \log t/t^2 = O(1)\), a direct calculation then gives us \({\mathbb {E}}[|\varPhi (S,{\mathcal {U}}^\downarrow )|] = O(n)\). Now assume \(S \propto R^n\). Define a random multi-set \(X = \{a.x: a \in S\}\), which consists of the x-coordinates of the n random points in S. We shall show that for all \(x_1,\dots ,x_n \in [0,1]\) such that \(x_1<\ldots <x_n\),

$$\begin{aligned} {\mathbb {E}}\bigl [|\varPhi (S,{\mathcal {U}}^\downarrow )|\big |X = \{x_1,\dots ,x_n\}\bigr ] = O(n), \end{aligned}$$
(1)

which implies that \({\mathbb {E}}[|\varPhi (S,{\mathcal {U}}^\downarrow )|] = O(n)\), because the random points in S have distinct x-coordinates with probability 1. Let \(I_i=x_i\times [0,\varDelta ]\) for \(i\in \{1,\dots ,n\}\), then \(I_1,\dots ,I_n\) are vertical aligned segments sorted from left to right. Note that, under the condition \(X=\{x_1,\dots ,x_n\}\), the n random points in S can be viewed as independently drawn from the uniform distributions on \(I_1,\dots ,I_n\), respectively. Thus, (1) follows directly from our previous argument for the case \(S\propto \prod _{i=1}^n I_i\). As a result, \({\mathbb {E}}[|\varPhi (S,{\mathcal {U}}^\downarrow )|] = O(n)\). \(\square \)

Using the above result and our previous data structure, we immediately conclude that the average-case space cost of our \({\mathcal {P}}\)-RCP data structure is O(n). To build this data structure, we use the same method as in Sect. 2. Abam et al. [1] showed that one can compute in \(O(n \log ^2\! n)\) time a \({\mathcal {P}}^\text {v}\)-local t-spanner \(\varPsi \) of S for some \(t < 2\) such that \(|\varPsi | = O(n \log n)\); see Sect. 2 for the definition of local spanners. As argued in Sect. 2, we have \(\varPhi (S,{\mathcal {P}}^\text {v}) \subseteq \varPsi \). We then overlay the corresponding northeast quadrants of the pairs in \(\varPsi \) to construct a planar subdivision \(\varGamma '\). This can be done in \(O(n \log ^2\! n)\) using the algorithm in Sect. 2 for overlaying quadrants. Using the same argument as in Sect. 2, we see that \(\varGamma '\) is the same as the subdivision \(\varGamma \) constructed using the candidate pairs in \(\varPhi (S,{\mathcal {P}}^v )\) and hence can be used to answer RCP queries. Therefore, the overall preprocessing time of our \({\mathcal {P}}\)-RCP data structure is \(O(n \log ^2\! n)\).

Theorem 3.3

There exists a \({\mathcal {P}}\)-RCP data structure \({\mathcal {B}}\) such that:

  • For any \(S\subseteq {\mathbb {R}}^2\) of size n,

    $$\begin{aligned} { {\mathsf {Space}}}({\mathcal {B}}(S))=O(n \log n)\quad \text {and}\quad { {\mathsf {Qtime}}}({\mathcal {B}}(S)) = O(\log n). \end{aligned}$$
  • For a random \(S \propto R^n\) where R is the unit square or more generally an arbitrary axes-parallel rectangle,

    $$\begin{aligned} {\mathbb {E}}[{ {\mathsf {Space}}}({\mathcal {B}}(S))] = O(n)\quad \text {and}\quad {\mathbb {E}}[{ {\mathsf {Qtime}}}({\mathcal {B}}(S))] = O(\log n). \end{aligned}$$

Furthermore, the above data structure can be constructed in \(O(n \log ^2\! n)\) worst-case time.

4 Rectangle Query

We consider the RCP problem for rectangle queries, i.e., the \({\mathcal {R}}\)-RCP problem. Unlike the data structures presented in the previous sections, our \({\mathcal {R}}\)-RCP data structure is somewhat complicated, so we give an brief overview below before discussing the details.

Overview. Interestingly, our final data structure for the \({\mathcal {R}}\)-RCP problem is a combination of two simpler \({\mathcal {R}}\)-RCP data structures, each of which partially achieves the desired bounds. Specifically, the first data structure has the desired worst-case space cost and query time, while the second data structure has the desired average-case space cost and has an even better (worst-case) query time of \(O(\log n)\). By properly combining the two data structures (Sect. 4.4), we obtain our final \({\mathcal {R}}\)-RCP data structure, which achieves simultaneously the desired worst-case and average-case bounds. Both of the simpler data structures are based on range trees [3] and our results for quadrant and strip RCP problems presented in the previous sections.

In what follows, we first describe the common part of the two simpler \({\mathcal {R}}\)-RCP data structures. Let \(S \subseteq {\mathbb {R}}^2\) be a dataset of size n. The common component of our two data structures is a standard 2D range tree built on S [3]. The main tree (or primary tree) \({\mathcal {T}}\) is a range tree storing at its leaves the x-coordinates of the points in S. Each node \({\mathbf {u}}\in {\mathcal {T}}\) corresponds to a subset \(S({\mathbf {u}})\) of x-consecutive points in S, called the canonical subset of \({\mathbf {u}}\). At \({\mathbf {u}}\), there is an associated secondary tree \({\mathcal {T}}_{\mathbf {u}}\), which is a range tree whose leaves store the y-coordinates of the points in \(S({\mathbf {u}})\). With an abuse of notation, for each node \({\mathbf {v}}\in {\mathcal {T}}_{\mathbf {u}}\), we still use \(S({\mathbf {v}})\) to denote the canonical subset of \({\mathbf {v}}\), which is a subset of y-consecutive points in \(S({\mathbf {u}})\). As in [6], for each (non-leaf) primary node \({\mathbf {u}}\in {\mathcal {T}}\), we fix a vertical line \(l_{\mathbf {u}}\) such that the points in the canonical subset of the left (resp., right) child of \({\mathbf {u}}\) are to the left (resp., right) of \(l_{\mathbf {u}}\). Similarly, for each (non-leaf) secondary node \({\mathbf {v}}\), we fix a horizontal line \(l_{\mathbf {v}}\) such that the points in the canonical subset of the left (resp., right) child of \({\mathbf {v}}\) are above (resp., below) \(l_{\mathbf {v}}\). Let \({\mathbf {v}}\in {\mathcal {T}}_{\mathbf {u}}\) be a secondary node. Then at \({\mathbf {v}}\) we have two lines \(l_{\mathbf {v}}\) and \(l_{\mathbf {u}}\), which partition \({\mathbb {R}}^2\) into four quadrants. We denote by \(S_1({\mathbf {v}}),\dots ,S_4({\mathbf {v}})\) the subsets of \(S({\mathbf {v}})\) contained in these quadrants; see Fig. 2 for the correspondence.

Fig. 2
figure 2

Illustrating the subsets \(S_1({\mathbf {v}}),\dots ,S_4({\mathbf {v}})\)

In order to solve the problem, we need to store some additional data structures (called sub-structures) at the nodes of the tree. At each secondary node \({\mathbf {v}}\), we store four \({\mathcal {Q}}\)-RCP data structures \({\mathcal {A}}(S_1({\mathbf {v}})),\dots ,{\mathcal {A}}(S_4({\mathbf {v}}))\) (Theorem 2.2).

Now let us explain what we can do by using this 2D range tree (with the sub-structures). Let \(R = [x_1,x_2] \times [y_1,y_2] \in {\mathcal {R}}\) be a query rectangle. We first find in \({\mathcal {T}}\) the splitting node \({\mathbf {u}} \in {\mathcal {T}}\) corresponding to the range \([x_1,x_2]\), which is by definition the LCA of all the leaves whose corresponding points are in \([x_1,x_2] \times {\mathbb {R}}\). Then we find in \({\mathcal {T}}_{\mathbf {u}}\) the splitting node \({\mathbf {v}} \in {\mathcal {T}}_{\mathbf {u}}\) corresponding to the range \([y_1,y_2]\). If either of the splitting nodes does not exist or is a leaf node, then \(|S \cap R| \le 1\) and nothing should be reported. So assume \({\mathbf {u}}\) and \({\mathbf {v}}\) are non-leaf nodes. By the property of splitting node, we have \(S \cap R = S({\mathbf {v}}) \cap R\), and the lines \(l_{\mathbf {u}}\) and \(l_{\mathbf {v}}\) both intersect R. Thus, \(l_{\mathbf {u}}\) and \(l_{\mathbf {v}}\) decompose R into four smaller rectangles \(R_1,\dots ,R_4\); see Fig. 3 for the correspondence. By construction, we have \(S({\mathbf {v}})\cap R_i = S_i({\mathbf {v}}) \cap R_i\). In order to find the closest-pair in \(S\cap R\), we first compute the closest-pair in \(S\cap R_i\) for all \(i\in \{1,\dots ,4\}\). This can be done by querying the sub-structures stored at \({\mathbf {v}}\). Indeed, \(S\cap R_i=S({\mathbf {v}})\cap R_i = S_i({\mathbf {v}})\cap R_i = S_i({\mathbf {v}}) \cap Q_i\), where \(Q_i\) is the quadrant obtained by removing the two sides of \(R_i\) that coincide with \(l_{\mathbf {u}}\) and \(l_{\mathbf {v}}\). Therefore, we can query \({\mathcal {A}}(S_i({\mathbf {v}}))\) with \(Q_i\) to find the closest-pair in \(S\cap R_i\). Once the four closest-pairs are computed, we take the shortest one (i.e., the one of the smallest length) among them and denote it by \(\phi \).

Fig. 3
figure 3

Illustrating the rectangles \(R_1,\dots ,R_4\)

Clearly, \(\phi \) is not necessarily the closest-pair in \(S \cap R\) as the two points in the closest-pair may belong to different \(R_i\)’s. However, as we will see, with \(\phi \) in hand, finding the closest-pair in \(S \cap R\) becomes easier. Suppose \(l_{\mathbf {u}}: x = \alpha \) and \(l_{\mathbf {v}}: y = \beta \), where \(x_1 \le \alpha \le x_2\) and \(y_1 \le \beta \le y_2\). Let \(\delta \) be the length of \(\phi \). We define \(P_\alpha =[\alpha -\delta ,\alpha +\delta ] \times {\mathbb {R}}\) (resp., \(P_\beta = {\mathbb {R}} \times [\beta -\delta ,\beta +\delta ]\)) and \(R_\alpha = R \cap P_\alpha \) (resp., \(R_\beta = R \cap P_\beta \)); see Fig. 4.

Fig. 4
figure 4

Illustrating the rectangles \(R_\alpha \) and \(R_\beta \)

We have the following key observation.

Lemma 4.1

The closest-pair in \(S \cap R\) is the shortest one among \(\{\phi ,\phi _\alpha ,\phi _\beta \}\), where \(\phi _\alpha \) (resp., \(\phi _\beta \)) is the closest-pair in \(S \cap R_\alpha \) (resp., \(S \cap R_\beta \)).

Proof

Let \(\phi ^* = (a^*,b^*)\) be the closest-pair in \(S \cap R\). Since \(\phi ,\phi _\alpha ,\phi _\beta \) are all point-pairs in \(S \cap R\), it suffices to show that \(\phi ^* \in \{\phi ,\phi _\alpha ,\phi _\beta \}\). If \(\phi ^* = \phi \), we are done. So assume \(\phi ^* \ne \phi \). Then \(a^*\) and \(b^*\) must be contained in different \(R_i\)’s. It follows that the segment \([a^*,b^*]\) intersects either \(l_{\mathbf {u}}\) or \(l_{\mathbf {v}}\). Note that the length of \(\phi ^*\) is at most \(\delta \) (recall that \(\delta \) is the length of \(\phi \)), which implies \(|a^*.x - b^*.x| \le \delta \) and \(|a^*.y - b^*.y|\le \delta \). If \([a^*,b^*]\) intersects \(l_{\mathbf {u}}\), then \(a^*,b^* \in P_\alpha \) (because \(|a^*.x - b^*.x| < \delta \)). Thus, \(a^*,b^* \in R_\alpha \) and \(\phi ^* = \phi _\alpha \). Similarly, if \([a^*,b^*]\) intersects \(l_{\mathbf {v}}\), we have \(\phi ^* = \phi _\beta \). As a result, \(\phi ^* \in \{\phi ,\phi _\alpha ,\phi _\beta \}\). \(\square \)

Due to the above lemma, it now suffices to compute \(\phi _\alpha \) and \(\phi _\beta \). Note that \(R_\alpha \) and \(R_\beta \) are rectangles, so computing \(\phi _\alpha \) and \(\phi _\beta \) still requires rectangle RCP queries. Fortunately, there are some additional properties which make it easy to search for the closest-pairs in \(S\cap R_\alpha \) and \(S\cap R_\beta \). For a set A of points in \({\mathbb {R}}^2\) and \(a,b \in A\), we define the x-gap (resp., y-gap) between a and b in A as the number of the points in \(A \setminus \{a,b\}\) whose x-coordinates (resp., y-coordinates) are in between a.x and b.x (resp., a.y and b.y).

Lemma 4.2

There exists a constant integer k such that the y-gap (resp., x-gap) between the two points of \(\phi _\alpha \) (resp., \(\phi _\beta \)) in \(S \cap R_\alpha \) (resp., \(S \cap R_\beta \)) is at most k.

Proof

We only need to consider \(\phi _\alpha \). Let \(k=100\). The reason for this choice will become clear shortly. Suppose \(\phi _\alpha = (a,b)\). We denote by w the left-right width of \(R_\alpha \), i.e., the distance between the left and right boundaries of \(R_\alpha \). By the construction of \(R_\alpha \), we have \(w \le 2 \delta \). We consider two cases: \(|a.y-b.y| \ge 2w\) and \(|a.y-b.y| < 2w\). Suppose \(|a.y-b.y| \ge 2w\). Assume there are more than k points in \((S \cap R_\alpha ) \setminus \{a,b\}\) whose y-coordinates are in between a.y and b.y. Then by the pigeonhole principle, we can find, among these points, two points \(a'\) and \(b'\) such that \(|a'.y-b'.y| \le |a.y-b.y|/k\). Since \(a',b' \in R_\alpha \), we have \(|a'.x-b'.x|\le w\le |a.y-b.y|/2\). It follows that

$$\begin{aligned}{\text {dist}}(a',b') \le |a'.x-b'.x| + |a'.y-b'.y| < |a.y-b.y| \le {\text {dist}}(a,b),\end{aligned}$$

which contradicts the fact that \(\phi _\alpha \) is the closest-pair in \(S \cap R_\alpha \). Next, suppose \(|a.y-b.y| < 2w\). Then \(|a.y-b.y| < 4 \delta \). Consider the rectangle \(R^* = R_\alpha \cap ({\mathbb {R}} \times [a.y,b.y])\). Note that \((S \cap R^*) \setminus \{a,b\}\) consists of exactly the points in \((S \cap R_\alpha ) \setminus \{a,b\}\) whose y-coordinates are in between a.y and b.y. Therefore, it suffices to show \(|S \cap R^*| \le k\). Let \(R_i^* = R^* \cap R_i\) for \(i \in \{1,\dots ,4\}\). Since \(R_i^* \subseteq R_i\), the pairwise distances of the points in \(S \cap R_i^*\) are at least \(\delta \). Furthermore, the left-right width of each \(R_i^*\) is at most \(\delta \) and the top-bottom width of each \(R_i^*\) is at most \(|a.y-b.y|\) (which is smaller than \(4 \delta \)). Therefore, a simple argument using the pigeonhole principle shows that \(|S \cap R_i^*| \le 16 < 25 = k/4\). As such, \(|S \cap R^*| \le k\). \(\square \)

We shall use the above lemma to help compute \(\phi _\alpha \) and \(\phi _\beta \). At this point, our two data structures diverge.

4.1 Preliminary: Extreme Point Data Structures

Before presenting our results, we introduce the so-called top/bottom extreme point (TBEP) and left/right extreme point (LREP) data structures. For a query space \({\mathcal {X}}\) and a constant integer k, an \(({\mathcal {X}},k)\)-TBEP (resp. \(({\mathcal {X}},k)\)-LREP) data structure stores a given set S of points in \({\mathbb {R}}^2\) and can report the k topmost/bottommost (resp., leftmost/rightmost) points in \(S \cap X\) for a query range \(X \in {\mathcal {X}}\).

Lemma 4.3

Let k be a constant integer. There exists a \(({\mathcal {P}}^\text {v},k)\)-TBEP data structure \({\mathcal {K}}^\text {v}\) such that for any \(S \subseteq {\mathbb {R}}^2\) of size n, \({ {\mathsf {Space}}}({\mathcal {K}}^\text {v}(S)) = O(n)\) and \({ {\mathsf {Qtime}}}({\mathcal {K}}^\text {v}(S)) = O(\log n)\). Furthermore, the above data structure can be constructed in \(O(n \log n)\) worst-case time. Symmetrically, there also exists a \(({\mathcal {P}}^\text {h},k)\)-LREP data structure \({\mathcal {K}}^\text {h}\) satisfying the same bounds.

Proof

Let \(S \subseteq {\mathbb {R}}^2\) be a dataset of size n. The \(({\mathcal {P}}^\text {v},k)\)-TBEP data structure instance \({\mathcal {K}}^\text {v}(S)\) is a standard 1D range tree \({\mathcal {T}}\) built on the x-coordinates of the points in S. By the construction of a range tree, each node \({\mathbf {u}} \in {\mathcal {T}}\) corresponds to a subset \(S({\mathbf {u}})\) of x-consecutive points in S, called the canonical subset of \({\mathbf {u}}\). The leaves of \({\mathcal {T}}\) one-to-one correspond to the points in S. At each node \({\mathbf {u}} \in {\mathcal {T}}\), we store the k topmost and k bottommost points in \(S({\mathbf {u}})\); we denote the set of these 2k points by \(K({\mathbf {u}})\). The overall space cost of the range tree (with the stored points) is clearly O(n), as k is a constant.

To answer a query \(P = [x_1,x_2] \times {\mathbb {R}} \in {\mathcal {P}}^\text {v}\), we first find the \(t = O(\log n)\) canonical nodes \({\mathbf {u}}_1,\dots ,{\mathbf {u}}_t \in {\mathcal {T}}\) corresponding to the range \([x_1,x_2]\). This is a standard range-tree operation, which can be done in \(O(\log n)\) time. We compute \(K = \bigcup _{i=1}^t K({\mathbf {u}}_i)\) in \(O(\log n)\) time. We then use selection to find the k topmost and k bottommost points in K; this can be done in \(O(\log n)\) time since \(|K| = 2kt = O(\log n)\). These 2k points are just the k topmost and k bottommost points in \(S \cap P\).

The above data structure can be easily built in \(O(n \log n)\) time, because finding \(K({\mathbf {u}})\) for each node \({\mathbf {u}}\in {\mathcal {T}}\) takes \(|S({\mathbf {u}})|\) time (via selection). The \(({\mathcal {P}}^h ,k)\)-LREP data structure \({\mathcal {K}}^\text {h}\) is constructed in a symmetric way. \(\square \)

Lemma 4.4

Let l be a vertical (resp., horizontal) line and k be a constant integer. There exists a \(({\mathcal {P}}_l,k)\)-TBEP (resp., \(({\mathcal {P}}_l,k)\)-LREP) data structure \({\mathcal {K}}_l\) such that for \(S\propto \prod _{i=1}^nI_i\) where \(I_1,\dots ,I_n\) are distinct vertical (resp., horizontal) aligned segments, \({\mathbb {E}}[{ {\mathsf {Space}}}({\mathcal {K}}_l(S))] = O(\log n)\) and \({\mathbb {E}}[{ {\mathsf {Qtime}}}({\mathcal {K}}_l(S))]=O(\log \log n)\). Furthermore, the above data structure can be built in \(O(n \log n)\) worst-case time.

4.2 First Data Structure

We now introduce our first \({\mathcal {R}}\)-RCP data structure, which achieves the desired worst-case bounds. Let k be the constant integer in Lemma 4.2. In our first data structure, besides the 2D range tree presented before, we build additionally two 1D range trees \({\mathcal {T}}'\) and \({\mathcal {T}}''\) on S, where \({\mathcal {T}}'\) (resp., \({\mathcal {T}}''\)) is built on y-coordinates (resp., x-coordinates). For \({\mathbf {u}}' \in {\mathcal {T}}'\) (resp., \({\mathbf {u}}'' \in {\mathcal {T}}''\)), we still use \(S({\mathbf {u}}')\) (resp., \(S({\mathbf {u}}'')\)) to denote the canonical subset of \({\mathbf {u}}'\) (resp., \({\mathbf {u}}'' \in {\mathcal {T}}''\)). At each node \({\mathbf {u}}' \in {\mathcal {T}}'\), we store a \({\mathcal {P}}\)-RCP data structure \({\mathcal {B}}(S({\mathbf {u}}'))\) (Theorem 3.3) and a \(({\mathcal {P}}^\text {v},k)\)-TBEP data structure \({\mathcal {K}}^\text {v}(S({\mathbf {u}}'))\) (Lemma 4.3). Similarly, at each node \({\mathbf {u}}'' \in {\mathcal {T}}''\), we store a \({\mathcal {P}}\)-RCP data structure \({\mathcal {B}}(S({\mathbf {u}}''))\) (Theorem 3.3) and a \(({\mathcal {P}}^\text {h},k)\)-LREP data structure \({\mathcal {K}}^\text {h}(S({\mathbf {u}}''))\) (Lemma 4.3).

We now explain how to compute \(\phi _\alpha \) and \(\phi _\beta \). Suppose \(R_\alpha = [x_\alpha ,x_\alpha '] \times [y_\alpha ,y_\alpha ']\). Let \(P_x = [x_\alpha ,x_\alpha '] \times {\mathbb {R}}\) and \(P_y = {\mathbb {R}} \times [y_\alpha ,y_\alpha ']\). To compute \(\phi _\alpha \), we first find in \({\mathcal {T}}'\) the \(t = O(\log n)\) canonical nodes \({\mathbf {u}}_1',\dots ,{\mathbf {u}}_t' \in {\mathcal {T}}'\) corresponding to the range \([y_\alpha ,y_\alpha ']\). Then \(\bigcup _{i=1}^t S({\mathbf {u}}_i') = S \cap P_y\), and each \(S({\mathbf {u}}_i')\) is a set of y-consecutive points in \(S \cap P_y\). Furthermore, \(S \cap R_\alpha = \bigcup _{i=1}^t S({\mathbf {u}}_i') \cap P_x\). We query the sub-structures \({\mathcal {B}}(S({\mathbf {u}}_1')),\dots ,{\mathcal {B}}(S({\mathbf {u}}_t'))\) with \(P_x\) to find the closest-pairs \(\phi _1,\dots ,\phi _t\) in \(S({\mathbf {v}}_1) \cap P_x,\dots ,S({\mathbf {v}}_t) \cap P_x\), respectively. We also query \({\mathcal {K}}^\text {v}(S({\mathbf {u}}_1')),\dots ,{\mathcal {K}}^\text {v}(S({\mathbf {u}}_t'))\) with \(P_x\) to obtain the k topmost and bottommost points in \(S({\mathbf {u}}_1') \cap P,\dots ,S({\mathbf {u}}_t') \cap P\), respectively; we denote by K the set of the 2kt reported points. Then we find the closest-pair \(\phi _K\) in K using the standard divide-and-conquer algorithm.

We claim that \(\phi _\alpha \) is the shortest one among \(\{\phi _1,\dots ,\phi _t,\phi _K\}\). Suppose \(\phi _\alpha =(a,b)\). If the two points of \(\phi _\alpha \) are both contained in some \(S({\mathbf {u}}_i')\), then clearly \(\phi _\alpha =\phi _i\). Otherwise, by Lemma 4.2 and the choice of k, the two points of \(\phi _\alpha \) must belong to K and hence \(\phi _\alpha =\phi _K\). It follows that \(\phi _\alpha \in \{\phi _1,\dots ,\phi _t,\phi _K\}\). Furthermore, because the pairs \(\phi _1,\dots ,\phi _t,\phi _K\) are all contained in \(R_\alpha \), \(\phi _\alpha \) must be the shortest one among \(\{\phi _1,\dots ,\phi _t,\phi _K\}\). Therefore, with \(\phi _1,\dots ,\phi _t,\phi _K\) in hand, \(\phi _\alpha \) can be easily computed. The pair \(\phi _\beta \) is computed symmetrically using \({\mathcal {T}}''\). Finally, taking the shortest one among \(\{\phi ,\phi _\alpha ,\phi _\beta \}\), the query R can be answered.

The 2D range tree together with the two 1D range trees \({\mathcal {T}}'\) and \({\mathcal {T}}''\) forms our first \({\mathcal {R}}\)-RCP data structure. A straightforward analysis gives us the worst-case performance of this data structure.

Theorem 4.5

There exists an \({\mathcal {R}}\)-RCP data structure \({\mathcal {D}}_1\) such that for any \(S \subseteq {\mathbb {R}}^2\) of size n, \({ {\mathsf {Space}}}({\mathcal {D}}_1(S)) = O(n \log ^2\! n)\) and \({ {\mathsf {Qtime}}}({\mathcal {D}}_1(S)) = O(\log ^2\! n)\). Furthermore, the above data structure can be constructed in \(O(n \log ^4 \!n)\) worst-case time.

Proof

We first analyze the space cost. Let \({\mathbf {v}}\) be a secondary node of the 2D range tree. By Theorem 2.2, the space cost of the sub-structures stored at \({\mathbf {v}}\) is \(O(|S({\mathbf {v}})|)\). Therefore, for a primary node \({\mathbf {u}} \in {\mathcal {T}}\) of the 2D range tree, the space cost of \({\mathcal {T}}_{\mathbf {u}}\) (with the sub-structures) is \(O(|S({\mathbf {u}})| \log {|S({\mathbf {u}})|})\). As a result, the entire space cost of the 2D range tree is \(O(n \log ^2\! n)\). Let \({\mathbf {u}}' \in {\mathcal {T}}'\) be a node of the 1D range tree \({\mathcal {T}}'\). By Theorem 3.3 and Lemma 4.3, the space cost of the sub-structures stored at \({\mathbf {u}}'\) is \(O(|S({\mathbf {u}}')| \log {|S({\mathbf {u}}')|})\). As such, the entire space cost of \({\mathcal {T}}'\) is \(O(n \log ^2\! n)\). For the same reason, the space cost of \({\mathcal {T}}''\) is \(O(n \log ^2\! n)\), and hence the entire space cost of \({\mathcal {D}}_1\) is \(O(n \log ^2\! n)\).

Next, we analyze the query time. When answering a query, we need to compute the pairs \(\phi ,\phi _\alpha ,\phi _\beta \) in Lemma 4.1. To compute \(\phi \), we first find the splitting nodes \({\mathbf {u}} \in {\mathcal {T}}\) and \({\mathbf {v}} \in {\mathcal {T}}_{\mathbf {u}}\). This is done by a top-down walk in \({\mathcal {T}}\) and \({\mathcal {T}}_{\mathbf {u}}\), which takes \(O(\log n)\) time. Then we query the sub-structures \({\mathcal {A}}(S_1({\mathbf {v}})),\dots ,{\mathcal {A}}(S_4({\mathbf {v}}))\), which can be done in \(O(\log n)\) time by Theorem 2.2. Thus, the time for computing \(\phi \) is \(O(\log n)\). To compute \(\phi _\alpha \), we first find the \(t = O(\log n)\) canonical nodes \({\mathbf {u}}_1',\dots ,{\mathbf {u}}_t' \in {\mathcal {T}}'\), which can be done in \(O(\log n)\) time. Then we query the sub-structures \({\mathcal {B}}(S({\mathbf {u}}_1')),\dots ,{\mathcal {B}}(S({\mathbf {u}}_t'))\) and \({\mathcal {K}}^\text {v}(S({\mathbf {u}}_1')),\dots ,{\mathcal {K}}^v (S({\mathbf {u}}_t'))\) to obtain the pairs \(\phi _1,\dots ,\phi _t\) and the set K of 2kt points. By Theorem 3.3 and Lemma 4.3, this step can be done in \(O(\log ^2\! n)\) time. Finally, we compute the closest-pair \(\phi _K\) in K using the standard divide-and-conquer algorithm, which takes \(O(\log n \log \log n)\) time since \(|K| = O(\log n)\). Thus, the time for computing \(\phi _\alpha \) is \(O(\log ^2\! n)\), so is the time for computing \(\phi _\beta \). As a result, the overall query time is \(O(\log ^2\! n)\).

Finally, we analyze the preprocessing time. By Theorem 2.2, to build the sub-structures stored at each secondary node \({\mathbf {v}}\) of the 2D range tree takes \(O(|S({\mathbf {v}})| \log ^2\! |S({\mathbf {v}})|)\) time. Therefore, the time for building the 2D range tree is \(O(n \log ^4\! n)\). By Theorem 3.3 and Lemma 4.3, to build the sub-structures at each node \({\mathbf {u}}\) of \({\mathcal {T}}'\) or \({\mathcal {T}}''\) takes \(O(|S({\mathbf {u}})| \log ^2\! |S({\mathbf {u}})|)\) time. Therefore, the time for building the two 1D range trees \({\mathcal {T}}'\) and \({\mathcal {T}}''\) is \(O(n \log ^3\! n)\). As a result, the overall preprocessing time is \(O(n \log ^4\! n)\). \(\square \)

Our first data structure itself already achieves the desired worst-case bounds, which simultaneously improves the results given in [6, 9].

4.3 Second Data Structure

We now introduce our second \({\mathcal {R}}\)-RCP data structure, which has the desired average-case space cost and an \(O(\log n)\) query time (even in worst-case). Furthermore, this data structure can be constructed efficiently in average-case. In our second data structure, we only use the 2D range tree presented before, but we need some additional sub-structures stored at each secondary node. Let k be the constant integer in Lemma 4.2. Define \(S_\blacktriangle ({\mathbf {v}}) = S_3({\mathbf {v}}) \cup S_4({\mathbf {v}})\) (resp., \(S_\blacktriangledown ({\mathbf {v}}) = S_1({\mathbf {v}}) \cup S_2({\mathbf {v}})\)) as the subset of \(S({\mathbf {v}})\) consisting of the points above (resp., below) \(l_{\mathbf {v}}\). Similarly, define \(S_\blacktriangleleft ({\mathbf {v}})\) and \(S_\blacktriangleright ({\mathbf {v}})\) as the subsets to the left and right of \(l_{\mathbf {u}}\), respectively. Let \({\mathbf {v}} \in {\mathcal {T}}_{\mathbf {u}}\) be a secondary node. Besides \({\mathcal {A}}(S_1({\mathbf {v}})),\dots ,{\mathcal {A}}(S_4({\mathbf {v}}))\), we store at \({\mathbf {v}}\) two \(({\mathcal {P}}_{l_{\mathbf {u}}},k)\)-TBEP data structures \({\mathcal {K}}_{l_{\mathbf {u}}}(S_\blacktriangle ({\mathbf {v}})),{\mathcal {K}}_{l_{\mathbf {u}}}(S_\blacktriangledown ({\mathbf {v}}))\) (Lemma 4.4) and two \(({\mathcal {P}}_{l_{\mathbf {v}}},k)\)-LREP data structures \({\mathcal {K}}_{l_{\mathbf {v}}}(S_\blacktriangleleft ({\mathbf {v}})),{\mathcal {K}}_{l_{\mathbf {v}}}(S_\blacktriangleright ({\mathbf {v}}))\) (Lemma 4.4).

Furthermore, we need a new kind of sub-structures called range shortest-segment (RSS) data structures. For a query space \({\mathcal {X}}\), an \({\mathcal {X}}\)-RSS data structure stores a given set of segments in \({\mathbb {R}}^2\) and can report the shortest segment contained in a query range \(X\in {\mathcal {X}}\). For the case \({\mathcal {X}} = {\mathcal {U}}\), we have the following RSS data structure.

Lemma 4.6

There exists a \({\mathcal {U}}\)-RSS data structure \({\mathcal {C}}\) such that for any set G of m segments, \({ {\mathsf {Space}}}({\mathcal {C}}(G))=O(m^2)\) and \({ {\mathsf {Qtime}}}({\mathcal {C}}(G))=O(\log m)\). Furthermore, the above data structure can be built in \(O(m^2\log m)\) worst-case time.

Proof

It suffices to design the \({\mathcal {U}}^\downarrow \)-RSS data structure. We first notice the existence of a \({\mathcal {P}}^\text {v}\)-RSS data structure using O(m) space with \(O(\log m)\) query time, which can be built in \(O(m \log m)\) time. Indeed, by applying the method in Sect. 3, we immediately obtain this data structure (a segment here corresponds to a candidate pair in Sect. 3).

With this \({\mathcal {P}}^\text {v}\)-RSS data structure in hand, it is quite straightforward to design the desired \({\mathcal {U}}^\downarrow \)-RSS data structure. Let \(G=\{\sigma _1,\dots ,\sigma _m\}\) be a set of m segments where \(\sigma _i=[a_i,b_i]\). Define \(y_i=\max {\{a_i.y,b_i.y\}}\) and assume \(y_1\le \ldots \le y_m\). Now build a (balanced) binary search tree with keys \(y_1,\dots ,y_m\). We denote by \({\mathbf {u}}_i\) the node corresponding to \(y_i\). At \({\mathbf {u}}_i\), we build a \({\mathcal {P}}^\text {v}\)-RSS data structure described above built on the subset \(G_i=\{\sigma _1,\dots ,\sigma _i\}\subseteq G\). The overall space cost is clearly \(O(m^2)\), and the time for constructing the data structure is \(O(m^2\log m)\).

To answer a query \(U = [x_1,x_2] \times (-\infty ,y] \in {\mathcal {U}}^\downarrow \), we first use the binary search tree to find the maximum \(y_i\) that is less than or equal to y. This can be done in \(O(\log m)\) time. Let \(P = [x_1,x_2] \times {\mathbb {R}} \in {\mathcal {P}}^v \). Note that a segment \(\sigma \in G\) is contained in U iff \(\sigma \in G_i\) and \(\sigma \) is contained in P. Thus, we can find the desired segment by querying the \({\mathcal {P}}^\text {v}\)-RSS data structure stored at \({\mathbf {u}}_i\) with P, which takes \(O(\log m)\) time. \(\square \)

We now define \(\varPhi _\blacktriangle ({\mathbf {v}}) = \varPhi _{l_{\mathbf {u}}}(S_\blacktriangle ({\mathbf {v}}),{\mathcal {U}}^\downarrow )\), \(\varPhi _\blacktriangledown ({\mathbf {v}}) = \varPhi _{l_{\mathbf {u}}}(S_\blacktriangledown ({\mathbf {v}}),{\mathcal {U}}^\uparrow )\), \(\varPhi _\blacktriangleleft ({\mathbf {v}}) = \varPhi _{l_{\mathbf {v}}}(S_\blacktriangleleft ({\mathbf {v}}),{\mathcal {U}}^\rightarrow )\), and \(\varPhi _\blacktriangleright ({\mathbf {v}}) = \varPhi _{l_{\mathbf {v}}}(S_\blacktriangleright ({\mathbf {v}}),{\mathcal {U}}^\leftarrow )\). We can view \(\varPhi _\blacktriangle ({\mathbf {v}}), \varPhi _\blacktriangledown ({\mathbf {v}})\), \(\varPhi _\blacktriangleleft ({\mathbf {v}}),\varPhi _\blacktriangleright ({\mathbf {v}})\) as four sets of segments by identifying each point-pair (ab) as a segment [ab]. Then we apply Lemma 4.6 to build and store at \({\mathbf {v}}\) four \({\mathcal {U}}\)-RSS data structures \({\mathcal {C}}(\varPhi _\blacktriangle ({\mathbf {v}})), {\mathcal {C}}(\varPhi _\blacktriangledown ({\mathbf {v}})), {\mathcal {C}}(\varPhi _\blacktriangleleft ({\mathbf {v}})), {\mathcal {C}}(\varPhi _\blacktriangle ({\mathbf {v}}))\).

We now explain how to compute \(\phi _\alpha \) and \(\phi _\beta \). Let us consider \(\phi _\alpha \). Recall that \(\phi _\alpha \) is the closest-pair in \(S \cap R_\alpha \), i.e., in \(S({\mathbf {v}}) \cap R_\alpha \). Let P be the \(l_{\mathbf {u}}\)-anchored strip obtained by removing the top/bottom bounding line of \(R_\alpha \). If the two points of \(\phi _\alpha \) are on opposite sides of \(l_{\mathbf {v}}\), then by Lemma 4.2 its two points must be among the k bottommost points in \(S_\blacktriangle ({\mathbf {v}}) \cap P\) and the k topmost points in \(S_\blacktriangledown ({\mathbf {v}}) \cap P\) respectively. Using \({\mathcal {K}}_{l_{\mathbf {u}}}(S_\blacktriangle ({\mathbf {v}}))\) and \({\mathcal {K}}_{l_{\mathbf {u}}}(S_\blacktriangledown ({\mathbf {v}}))\), we report these 2k points, and compute the closest-pair among them by brute-force. If the two points of \(\phi _\alpha \) are on the same side of \(l_{\mathbf {v}}\), then they are both contained in either \(S_\blacktriangle ({\mathbf {v}})\) or \(S_\blacktriangledown ({\mathbf {v}})\). So it suffices to compute the closest-pairs in \(S_\blacktriangle ({\mathbf {v}}) \cap R_\alpha \) and \(S_\blacktriangledown ({\mathbf {v}}) \cap R_\alpha \). Without loss of generality, we only need to consider the closest-pair in \(S_\blacktriangle ({\mathbf {v}}) \cap R_\alpha \). We denote by U the 3-sided rectangle obtained by removing the bottom boundary of \(R_\alpha \), and by \(Q_1\) (resp., \(Q_2\)) the quadrant obtained by removing the right (resp., left) boundary of U. We query \({\mathcal {A}}(S_1({\mathbf {v}}))\) with \(Q_1\), \({\mathcal {A}}(S_2({\mathbf {v}}))\) with \(Q_2\), and \({\mathcal {C}}(\varPhi _\blacktriangle ({\mathbf {v}}))\) with U. Clearly, the shortest one among the three answers is the closest-pair in \(S_\blacktriangle ({\mathbf {v}}) \cap R_\alpha \). Indeed, the three answers are all point-pairs in \(S_\blacktriangle ({\mathbf {v}}) \cap R_\alpha \). If the two points of the closest-pair in \(S_\blacktriangle ({\mathbf {v}}) \cap R_\alpha \) are both to the left (resp., right) of \(l_{\mathbf {u}}\), \({\mathcal {A}}(S_1({\mathbf {v}}))\) (resp., \({\mathcal {A}}(S_2({\mathbf {v}}))\)) reports it; otherwise, the closest-pair crosses \(l_{\mathbf {u}}\), and \({\mathcal {C}}(\varPhi _\blacktriangle ({\mathbf {v}}))\) reports it. Now we see how to compute \(\phi _\alpha \), and \(\phi _\beta \) can be computed symmetrically. Finally, taking the shortest one among \(\{\phi ,\phi _\alpha ,\phi _\beta \}\), the query R can be answered.

A straightforward analysis shows that the overall query time is \(O(\log n)\) even in worst-case. The worst-case space cost is not near-linear, as the \({\mathcal {U}}\)-RSS data structure \({\mathcal {C}}\) may occupy quadratic space by Lemma 4.6. Interestingly, we can show that the average-case space cost is in fact \(O(n \log n)\). The crucial thing is to bound the average-case space of the sub-structures stored at the secondary nodes. The intuition for bounding the average-case space of the \({\mathcal {Q}}\)-RCP and TBEP/LREP sub-structures comes directly from the average-case performance of our \({\mathcal {Q}}\)-RCP data structure (Theorem 2.2) and TBEP/LREP data structure (Lemma 4.4).

However, to bound the average-case space of the \({\mathcal {U}}\)-RSS sub-structures is much more difficult. By our construction, the segments stored in these sub-structures are 3-sided candidate pairs that cross a line. As such, we have to study the expected number of such candidate pairs in a random dataset. To this end, we recall Lemma 3.1. Let l be a vertical line, and \(S \propto \prod _{i=1}^n I_i\) be a random dataset drawn from vertical aligned segments \(I_1,\dots ,I_n\) as in Lemma 3.1. Suppose we build a \({\mathcal {U}}\)-RSS data structure \({\mathcal {C}}(\varPhi )\) on \(\varPhi = \varPhi _l(S,{\mathcal {U}}^\downarrow )\). Using Lemma 3.1, a direct calculation gives us \({\mathbb {E}}[|\varPhi _l(S,{\mathcal {U}}^\downarrow )|] = O(\log ^2\! n)\). Unfortunately, this is not sufficient for bounding the average-case space of \({\mathcal {C}}(\varPhi )\), because \({\mathbb {E}}[\mathsf {Space}({\mathcal {C}}(\varPhi ))] = O({\mathbb {E}}[|\varPhi _l(S,{\mathcal {U}}^\downarrow )|^2])\) and in general \({\mathbb {E}}[|\varPhi _l(S,{\mathcal {U}}^\downarrow )|^2] \ne {\mathbb {E}}^2[|\varPhi _l(S,{\mathcal {U}}^\downarrow )|]\). Therefore, we need a bound for \({\mathbb {E}}[|\varPhi _l(S,{\mathcal {U}}^\downarrow )|^2]\), which can also be obtained using Lemma 3.1, but requires nontrivial work.

Lemma 4.7

Let l be a vertical (resp., horizontal) line and \(S \propto \prod _{i=1}^n I_i\) where \(I_1,\dots ,I_n\) are distinct vertical (resp., horizontal) aligned segments. Then for \({\mathcal {X}} \in \{{\mathcal {U}}^\downarrow ,{\mathcal {U}}^\uparrow \}\) (resp., \({\mathcal {X}} \in \{{\mathcal {U}}^\leftarrow ,{\mathcal {U}}^\rightarrow \}\)), \({\mathbb {E}}[|\varPhi _l(S,{\mathcal {X}})|] =O(\log ^2\! n)\) and \({\mathbb {E}}[|\varPhi _l(S,{\mathcal {X}})|^2] = O(\log ^4\! n)\).

With the above lemma in hand, we are ready to bound the average-case space cost of our second data structure. Before this, let us first consider the preprocessing of this data structure. We are not able to bound the worst-case preprocessing time (as the worst-case space cost of the data structure is not well-bounded). Therefore, what we want here is a preprocessing algorithm with near-linear average-case running time. The 2D range tree itself and its \({\mathcal {Q}}\)-RCP sub-structures can be built in the same way as in our first data structure using \(O(n \log ^4\! n)\) worst time. The TBEP/LREP sub-structures on each secondary node \({\mathbf {v}}\) can be built in \(O(|S({\mathbf {v}})| \log |S({\mathbf {v}})|)\) time by Lemma 4.4; hence the overall time cost for this part is \(O(n \log ^3\! n)\). It suffices to consider the \({\mathcal {U}}\)-RSS sub-structures. The difficulty here is how to find efficiently the 3-sided candidate pairs that cross a line; as long as we have these candidate pairs in hand, the \({\mathcal {U}}\)-RSS data structure can be built in \(O(m^2 \log m)\) time by Lemma 4.6 (where m is the number of the pairs).

Lemma 4.8

Let l be a vertical (resp., horizontal) line and \({\mathcal {X}} \in \{{\mathcal {U}}^\downarrow ,{\mathcal {U}}^\uparrow \}\) (resp., \({\mathcal {X}} \in \{{\mathcal {U}}^\leftarrow ,{\mathcal {U}}^\rightarrow \}\)). Then there exists an algorithm computing \(\varPhi _l(S,{\mathcal {X}})\) for an input dataset \(S \subseteq {\mathbb {R}}^2\) which runs in \(O(n \log ^4\! n)\) average-case time for a random \(S\propto \prod _{i=1}^n I_i\) where \(I_1,\dots ,I_n\) are distinct vertical (resp., horizontal) aligned segments. Therefore, one can construct the \({\mathcal {U}}\)-RSS data structure \({\mathcal {C}}(\varPhi _l(S,{\mathcal {X}}))\) described in Lemma 4.6 in \(O(n\log ^4\!n)\) average-case time for a random \(S \propto \prod _{i=1}^n I_i\).

Proof

It suffices to consider the case in which l is a vertical line. Suppose \({\mathcal {X}} = {\mathcal {U}}^\downarrow \). Let \(S \subseteq {\mathbb {R}}^2\) be a dataset of size n. We compute \(\varPhi _l(S,{\mathcal {U}}^\downarrow )\) as follows. First, we compute the rectangle candidate pairs \(\varPhi (S,{\mathcal {R}})\). As observed in [6], \(\varPhi (S,{\mathcal {R}})\) can be computed in \(O(n \log n + \varDelta )\) time where \(\varDelta = |\varPhi (S,{\mathcal {R}})|\). Note that \(\varPhi _l(S,{\mathcal {U}}^\downarrow ) \subseteq \varPhi (S,{\mathcal {U}}^\downarrow ) \subseteq \varPhi (S,{\mathcal {R}})\).

To further compute \(\varPhi (S,{\mathcal {U}}^\downarrow )\) from \(\varPhi (S,{\mathcal {R}})\), we build an \({\mathcal {R}}\)-RCP data structure \({\mathcal {D}}_1(S)\) as described in Theorem 4.5, which takes \(O(n \log ^4\! n)\) time. For each \(\phi = (a,b) \in \varPhi (S,{\mathcal {R}})\), let \(U_\phi = [\min {\{a.x,b.x\}},\max {\{a.x,b.x\}}]\times (-\infty ,\max {\{a.y,b.y\}}] \in {\mathcal {U}}^\downarrow \) be the smallest bottom-unbounded 3-sided rectangle that contains \(\phi \). Clearly, \(\phi \in \varPhi _l(S,{\mathcal {U}}^\downarrow )\) iff \(\phi \) crosses l and is the closest-pair in \(S \cap U_\phi \). Therefore, for each \(\phi \in \varPhi (S,{\mathcal {R}})\) that crosses l, we can query \({\mathcal {D}}_1(S)\) with \(U_\phi \) (note that an \({\mathcal {R}}\)-RCP data structure can also answer 3-sided rectangle queries) to check if \(\phi \in \varPhi _l(S,{\mathcal {U}}^\downarrow )\). Since the query time of \({\mathcal {D}}_1(S)\) is \(O(\log ^2\! n)\), we can compute \(\varPhi _l(S,{\mathcal {U}}^\downarrow )\) in \(O(\varDelta \log ^2\! n)\) time after \({\mathcal {D}}_1(S)\) is built. Therefore, the total time for computing \(\varPhi _l(S,{\mathcal {U}}^\downarrow )\) is \(O(n \log ^4\! n + \varDelta \log ^2\! n)\).

Gupta et al. [6] proved that \({\mathbb {E}}[\varDelta ] = O(n \log n)\) when S is generated uniformly and independently from the unit square. Although \(S\propto \prod _{i=1}^n I_i\) in our setting, the same proof actually results in the same bound, i.e., \({\mathbb {E}}[\varDelta ]=O(n \log n)\). As such, our algorithm for computing \(\varPhi _l(S,{\mathcal {U}}^\downarrow )\) takes \(O(n \log ^4\! n)\) average-case time when \(S\propto \prod _{i=1}^nI_i\). To further build the data structure \({\mathcal {C}}(\varPhi _l(S,{\mathcal {X}}))\), we apply Lemma 4.6, which takes \(O(m^2 \log m)\) time where \(m = |\varPhi _l(S,{\mathcal {X}})|\). Lemma 4.7 shows that \({\mathbb {E}}[m^2]=O(\log ^4\! n)\), which implies \({\mathbb {E}}[m^2 \log m] = O(\log ^5\! n)\) because the maximum possible value of the random variable m is bounded by \(O(n^2)\). Therefore, \({\mathcal {C}}(\varPhi _l(S,{\mathcal {X}}))\) can be constructed in \(O(n \log ^4\! n)\) average-case time when \(S \propto \prod _{i=1}^n I_i\), including the time for computing \(\varPhi _l(S,{\mathcal {U}}^\downarrow )\). \(\square \)

Now we are ready to prove the performance of our second data structure.

Theorem 4.9

There exists an \({\mathcal {R}}\)-RCP data structure \({\mathcal {D}}_2\) such that:

  • For any \(S \subseteq {\mathbb {R}}^2\) of size n, \({ {\mathsf {Qtime}}}({\mathcal {D}}_2(S)) = O(\log n)\).

  • For a random \(S \propto R^n\) where R is the unit square or more generally an arbitrary axes-parallel rectangle, \({\mathbb {E}}[{ {\mathsf {Space}}}({\mathcal {D}}_2(S))] = O(n \log n)\).

Furthermore, the above data structure can be constructed in \(O(n \log ^6\! n)\) average-case time.

4.4 Combining the Two Data Structures

We now combine the two data structures \({\mathcal {D}}_1\) (Theorem 4.5) and \({\mathcal {D}}_2\) (Theorem 4.9) to obtain a single data structure \({\mathcal {D}}\) that achieves the desired worst-case and average-case bounds simultaneously (and can be constructed in near-linear time). For a dataset \(S \subseteq {\mathbb {R}}^2\) of size n, if \(\mathsf {Space}({\mathcal {D}}_2(S)) \ge n \log ^2\! n\) or the time for constructing \({\mathcal {D}}_2(S)\) (using Theorem 4.9) is greater than \(n \log ^7\! n\), we set \({\mathcal {D}}(S) = {\mathcal {D}}_1(S)\), otherwise we set \({\mathcal {D}}(S) = {\mathcal {D}}_2(S)\). The reason for the choice of the thresholds \(n \log ^2\! n\) and \(n \log ^7\! n\) will be clear shortly (in the proof of Theorem 4.10). The worst-case bounds of \({\mathcal {D}}\) follow directly, while to see the average-case bounds of \({\mathcal {D}}\) requires a careful analysis using Markov’s inequality.

Theorem 4.10

There exists an \({\mathcal {R}}\)-RCP data structure \({\mathcal {D}}\) such that:

  • For any \(S \subseteq {\mathbb {R}}^2\) of size n,

    $$\begin{aligned} { {\mathsf {Space}}}({\mathcal {D}}(S)) = O(n \log ^2\! n)\quad \text {and}\quad { {\mathsf {Qtime}}}({\mathcal {D}}(S)) = O(\log ^2\! n). \end{aligned}$$
  • For a random \(S \propto R^n\) where R is the unit square or more generally an arbitrary axes-parallel rectangle,

    $$\begin{aligned} {\mathbb {E}}[{ {\mathsf {Space}}}({\mathcal {D}}(S))]=O(n\log n)\quad \text {and}\quad {\mathbb {E}}[{ {\mathsf {Qtime}}}({\mathcal {D}}(S))] = O(\log n). \end{aligned}$$

Furthermore, the above data structure can be constructed in \(O(n \log ^7\! n)\) worst-case time.

Proof

As mentioned above, our data structure \({\mathcal {D}}\) is obtained by combining \({\mathcal {D}}_1\) (Theorem 4.5) and \({\mathcal {D}}_2\) (Theorem 4.9) as follows. For any \(S \subseteq {\mathbb {R}}^2\) of size n, if \(\mathsf {Space}({\mathcal {D}}_2(S))\ge n \log ^2\! n\) or the time for constructing \({\mathcal {D}}_2(S)\) (using Theorem 4.9) is greater than \(n \log ^7\! n\), we set \({\mathcal {D}}(S) = {\mathcal {D}}_1(S)\), otherwise we set \({\mathcal {D}}(S) = {\mathcal {D}}_2(S)\). (The reason for choosing \(n\log ^7\! n\) will be clear at the end of the proof.) We claim that \({\mathcal {D}}\) has the desired perfomance.

We first consider the space cost and query time. Let \(S \subseteq {\mathbb {R}}^2\) be a dataset of size n. It is clear from the construction that \(\mathsf {Space}({\mathcal {D}}(S)) = O(n \log ^2\! n)\). Also, \(\mathsf {Qtime}({\mathcal {D}}(S)) = O(\log ^2\! n)\), since \(\mathsf {Qtime}({\mathcal {D}}_1(S)) = O(\log ^2\! n)\) and \(\mathsf {Qtime}({\mathcal {D}}_2(S)) = O(\log n)\). To analyze the average-case space and query time of \({\mathcal {D}}\), let \(S \propto R^n\) for an axes-parallel rectangle R. Define E as the event \(\mathsf {Space}({\mathcal {D}}_2(S)) \ge n \log ^2\! n\) and \(E'\) as the event that the time for constructing \({\mathcal {D}}_2(S)\) (using Theorem 4.9) is greater than \(n \log ^7\! n\). Let \(\lnot E\) and \(\lnot E'\) be the complement events of E and \(E'\), respectively. Since \({\mathbb {E}}[\mathsf {Space}({\mathcal {D}}_2(S))] = O(n \log n)\), we have \({\text {Pr}}[E] = O(1/\log n)\) by Markov’s inequality. Similarly, since the average-case preprocessing time of \({\mathcal {D}}_2\) is \(O(n \log ^6\! n)\), we also have \({\text {Pr}}[E'] = O(1/\log n)\) by Markov’s inequality. Therefore, \({\text {Pr}}[E''] = O(1/\log n)\) for \(E'' = E \vee E'\).

To bound the average-case space cost, we observe

$$\begin{aligned} {\mathbb {E}}[\mathsf {Space}({\mathcal {D}}(S))]&={\text {Pr}}[E''] \cdot {\mathbb {E}}[\mathsf {Space}({\mathcal {D}}_1(S))|E'']\\&\qquad + {\text {Pr}}[\lnot E''] \cdot {\mathbb {E}}[\mathsf {Space}({\mathcal {D}}_2(S))|\lnot E'']. \end{aligned}$$

Note that \({\text {Pr}}[E''] \cdot {\mathbb {E}}[\mathsf {Space}({\mathcal {D}}_1(S))|E''] = O(n \log n)\), since \(\mathsf {Space}({\mathcal {D}}_1(S)) = O(n \log ^2\! n)\) and \({\text {Pr}}[E''] = O(1/\log n)\). Also, \({\text {Pr}}[\lnot E''] \cdot {\mathbb {E}}[\mathsf {Space}({\mathcal {D}}_2(S))|\lnot E''] \le {\mathbb {E}}[\mathsf {Space}({\mathcal {D}}_2(S))] = O(n \log n)\). Thus, \({\mathbb {E}}[\mathsf {Space}({\mathcal {D}}(S))] = O(n \log n)\). To bound the average-case query time, let \(T_i\) be the worst-case query time of \({\mathcal {D}}_i\) built on a dataset of size n, for \(i \in \{1,2\}\). Then \({\mathbb {E}}[\mathsf {Qtime}({\mathcal {D}}(S))] \le {\text {Pr}}[E''] \cdot T_1 +{\text {Pr}}[\lnot E''] \cdot T_2\). Since \(T_1 = O(\log ^2\! n)\), \(T_2 = O(\log n)\), \({\text {Pr}}[E] = O(1/\log n)\), we have \({\mathbb {E}}[\mathsf {Qtime}({\mathcal {D}}(S))] = O(\log n)\).

Finally, we consider how to construct \({\mathcal {D}}(S)\) in \(O(n \log ^7\! n)\) time for a given dataset \(S \subseteq {\mathbb {R}}^2\) of size n. We first try to build \({\mathcal {D}}_2(S)\) using the algorithm in Theorem 4.9. We run the algorithm for \(n\log ^7\!n\) time units. If \({\mathcal {D}}_2(S)\) is not successfully constructed in such amount of time, we terminate the algorithm and build \({\mathcal {D}}_1(S)\) in \(O(n\log ^4\!n)\) time using Theorem 4.5, since we have \({\mathcal {D}}(S)={\mathcal {D}}_1(S)\) in this case. Assume \({\mathcal {D}}_2(S)\) is successfully constructed in \(n\log ^7\!n\) time. We then check if \(\mathsf {Space}({\mathcal {D}}_2(S))\le n\log ^2\!n\). If so, \({\mathcal {D}}(S)={\mathcal {D}}_2(S)\) and we are done. Otherwise, we have \({\mathcal {D}}(S) = {\mathcal {D}}_1(S)\), and thus we build \({\mathcal {D}}_1(S)\) in \(O(n\log ^4\! n)\) time. By the above argument, we can construct \({\mathcal {D}}(S)\) in \(O(n \log ^7\! n)\) worst-case time. \(\square \)

5 Halfplane Query

We consider the RCP problem for halfplane queries, i.e., the \({\mathcal {H}}\)-RCP problem. In order to solve the \({\mathcal {H}}\)-RCP problem, it suffices to consider the \({\mathcal {H}}^\uparrow \)-RCP problem. Let \(S \subseteq {\mathbb {R}}^2\) be the dataset of size n. Before discussing the details, we first give an overview of our method.

Overview. Similar to what we did for the quadrant and strip RCP problems, we consider candidate pairs. It was known [1] that the candidate pairs with respect to the query space \({\mathcal {H}}^\uparrow \) do not cross each other (when regarded as segments), which implies that the candidate pairs are edges of a planar graph (with vertex set S) and hence \(\varPhi (S,{\mathcal {H}}^\uparrow ) = O(n)\). The main difficulty here is to store these candidate pairs in a linear-space data structure that can answer halfplane queries in logarithmic time. To this end, we use duality [3] to map each candidate pair in the primal space to a wedge in the dual space. We then establish a key combinatorial result which shows that the arrangement obtained by overlaying the wedges in the dual space has a linear complexity. This result relies heavily on the fact that the candidate pairs are non-crossing. As a byproduct of this result, we also obtain an optimal halfspace RSS data structure for non-crossing segments (see Sect. 5.2). To achieve the average-case bounds, we prove that the number of candidate pairs with respect to halfplane queries is \(O(\log ^2 \!n)\) in average-case.

We begin by introducing the standard duality technique [3]. A non-vertical line \(l: y=ux+v\) in \({\mathbb {R}}^2\) is dual to the point \(l^*=(u, -v)\) and a point \(p = (s, t) \in {\mathbb {R}}^2\) is dual to the line \(p^*:y=sx-t\). A basic property of duality is that \(p\in l^\uparrow \) (resp., \(p\in l^\downarrow \)) iff \(l^* \in (p^*)^\uparrow \) (resp., \(l^*\in (p^*)^\downarrow \)). To make the exposition cleaner, we distinguish between primal space and dual space, which are two copies of \({\mathbb {R}}^2\). The dataset S and query ranges are assumed to lie in the primal space, while their dual objects are assumed to lie in the dual space.

Duality allows us to transform the \({\mathcal {H}}^\uparrow \)-RCP problem into a point location problem as follows. Let \(H = l^\uparrow \in {\mathcal {H}}^\uparrow \) be a query range. The line l bounding H is dual to the point \(l^*\) in the dual space; for convenience, we also call \(l^*\) the dual point of H. If we decompose the dual space into “cells” such that the query ranges whose dual points lie in the same cell have the same answer, then point location techniques can be applied to solve the problem directly. Note that this decomposition must be a polygonal subdivision \(\varGamma \) of \({\mathbb {R}}^2\), which consists of vertices, straight-line edges, and polygonal faces (i.e., cells). This is because the cell-boundaries must be defined by the dual lines of the points in S.

In order to analyze the space cost and query time, we need to study the complexity \(|\varGamma |\) of \(\varGamma \). An \(O(n^2)\) trivial upper bound for \(|\varGamma |\) follows from the fact that the subdivision formed by the n dual lines of the points in S has an \(O(n^2)\) complexity. Surprisingly, using additional properties of the problem, we can show that \(|\varGamma | = O(n)\), which is a key ingredient of our result in this section.

Fig. 5
figure 5

Illustrating the upward-open wedge \(W_i\)

Suppose \(\varPhi (S,{\mathcal {H}}^\uparrow ) = \{\phi _1,\dots ,\phi _m\}\) where \(\phi _i = (a_i,b_i)\) and \(\phi _1,\dots ,\phi _m\) are sorted in increasing order of their lengths. It was shown in [1] that \(m = O(n)\), and the candidate pairs do not cross each other (when identified as segments), i.e., the segments \([a_i,b_i]\) and \([a_j,b_j]\) do not cross for any \(i \ne j\). The non-crossing property of the candidate pairs is important and will be used later for proving Lemma 5.1. With this in hand, we now consider the subdivision \(\varGamma \).

Let \(H=l^\uparrow \in {\mathcal {H}}^\uparrow \) be a query range. By the property of duality, \(\phi _i\) is contained in H iff \(l^* \in (a_i^*)^\uparrow \) and \(l^* \in (b_i^*)^\uparrow \), i.e., \(l^*\) is in the upward-open wedge \(W_i\) generated by the lines \(a_i^*\) and \(b_i^*\) (in the dual space); see Fig. 5. As such, the closest-pair in \(S\cap H\) to be reported is \(\phi _\eta \) for \(\eta =\min {\{i: l^* \in W_i\}}\). Therefore, \(\varGamma \) can be constructed by successively overlaying the wedges \(W_1,\dots ,W_m\) (similarly to what we see in Sect. 2).

Formally, we begin with a trivial subdivision \(\varGamma _0\) of \({\mathbb {R}}^2\), which consists of only one face, the entire plane. Suppose \(\varGamma _{i-1}\) is constructed, which has an outer face \(F_{i-1}\) equal to the complement of \(\bigcup _{j=1}^{i-1} W_j\) in \({\mathbb {R}}^2\). Now we construct a new subdivision \(\varGamma _i\) by “inserting” \(W_i\) to \(\varGamma _{i-1}\). Specifically, \(\varGamma _i\) is obtained from \(\varGamma _{i-1}\) by decomposing the outer face \(F_{i-1}\) via the wedge \(W_i\); that is, we decompose \(F_{i-1}\) into several smaller faces: one is \(F_{i-1}\setminus W_i\) and the others are the connected components of \(F_{i-1}\cap W_i\). Note that \(F_{i-1}\setminus W_i\) is the complement of \(\bigcup _{j=1}^{i} W_j\), which is connected (as one can easily verify) and becomes the outer face \(F_i\) of \(\varGamma _i\). In this way, we construct \(\varGamma _1,\dots ,\varGamma _m\) in order, and it is clear that \(\varGamma _m=\varGamma \). The linear upper bound for \(|\varGamma |\) follows from the following technical result.

Lemma 5.1

\(|\varGamma _i| - |\varGamma _{i-1}| = O(1)\) for \(i \in \{1,\dots ,m\}\). In particular, \(|\varGamma | = O(m)\).

Fig. 6
figure 6

Illustrating the various cases in Lemma 5.1

Proof

Let \(F_i\) be the outer face of \(\varGamma _i\), and \(\partial W_i\) be the boundary of the wedge \(W_i\) (which consists of two rays emanating from the intersection point of \(a_i^*\) and \(b_i^*\)). We first note that, to deduce that \(|\varGamma _i|-|\varGamma _{i-1}|=O(1)\), it suffices to show that the number of the connected components of \(\partial W_i\cap F_{i-1}\) is constant. This is because every connected component of \(\partial W_i \cap F_{i-1}\) contributes to \(\varGamma _i\) exactly one new face, a constant number of new vertices, and a constant number of new edges. Indeed, we only need to check one branch of \(\partial W_i\) (i.e., one of the two rays of \(\partial W_i\)), say the ray contained in \(a_i^*\) (we denote it by r). We will show that \(r \cap F_{i-1}\) has O(1) connected components.

Without loss of generality, we may assume that \(a_i\) is to the left of \(b_i\). Then each point on r is dual to a line in the primal space, which goes through the point \(a_i\) with the segment \([a_i, b_i]\) above it. Note that \(r\cap F_{i-1}=r\setminus \bigcup _{j=1}^{i-1}W_j=r\setminus \bigcup _{j=1}^{i-1}(r \cap W_j)\), and each \(r \cap W_j\) is a connected portion of r. We consider each \(j \in \{1,\dots ,i-1\}\) and analyze the intersection \(r\cap W_j\). Let \(l_i\) be the line through \(a_i\) and \(b_i\). There are three cases to be considered separately: (1) \(a_j,b_j \in l_i^\uparrow \), (2) \(a_j,b_j \in l_i^\downarrow \), or (3) one of \(a_j,b_j\) is in \(l_i^\uparrow \setminus l_i\) (i.e., strictly above \(l_i\)) while the other is in \(l_i^\downarrow \setminus l_i\) (i.e., strictly below \(l_i\)).

[Case 1] In this case, \(a_j,b_j \in l_i^\uparrow \). The wedge \(W_j\) must contain the initial point \(r_0\) of r (i.e., the intersection point of \(a_i^*\) and \(b_i^*\), which is the dual of the line \(l_i\)), because \(r_0\in (a_j^*)^\uparrow \) and \(r_0\in (b_j^*)^\uparrow \). (See Fig. 6a.)

[Case 2] In this case, \(a_j,b_j \in l_i^\downarrow \). We claim that either \(r\cap W_j\) is empty or it contains the infinite end of r (i.e., the point at infinity along r). Imagine that we have a point p moving along r from \(r_0\) to the infinite end of r. Then p is dual to a line in the primal space rotating clockwise around \(a_i\) from the line \(l_i\) to the vertical line through \(a_i\); see Fig. 6b. Note that \(p\in r\cap W_j\) (in the dual space) only when \(a_j,b_j\in (p^*)^\uparrow \) (in the primal space). But \(a_j,b_j\in l_i^\downarrow \) in this case. When p is moving, the region \(l_i^\downarrow \cap (p^*)^\uparrow \) expands. As such, one can easily see that \(r \cap W_j\) must contain the infinite end of r if it is nonempty. (See Fig. 6, c and d.)

[Case 3] In this case, one point of \(a_j,b_j\) is in \(l_i^\uparrow \setminus l_i\) while the other is in \(l_i^\downarrow \setminus l_i\). Thus, the segment \([a_j, b_j]\) must intersect the line \(l_i\). However, as argued before, the segments \([a_j, b_j]\) and \([a_i, b_i]\) do not cross. So the intersection point c of \([a_j, b_j]\) and \(l_i\) is either to the left of \(a_i\) or to the right of \(b_i\) (recall that \(a_i\) is assumed to be to the left of \(b_i\)).

If c is to the left of \(a_i\), we claim that \(r \cap W_j\) is empty. Observe that the dual line of any point on r is through \(a_i\) and below \(b_i\), meaning that it must be above c (as c is to the left of \(a_i\)). In other words, the dual line of any point on r is above at least one of \(a_j,b_j\), and thus any point on r is not contained in the wedge \(W_j\), i.e., \(r \cap W_j\) is empty. (See Fig. 6e.)

The most subtle case occurs when c is to the right of \(b_i\). In such a case, we consider the line through \(a_i\) perpendicular to \(l_i\), which we denote by \(l_i'\). We first argue that both \(a_j\) and \(b_j\) must be on the same side of \(l_i'\) as \(b_i\). Since c is to the right of \(b_i\), at least one of \(a_j,b_j\) is on the same side of \(l_i'\) as \(b_i\). However, we notice that \([a_j,b_j]\) cannot intersect \(l_i'\), otherwise the length of \(\phi _j\) is (strictly) greater than that of \(\phi _i\), contradicting the fact that \(j<i\) (recall that \(\phi _1,\dots ,\phi _m\) is sorted in increasing order of their lengths). So the only possibility is that \(a_j,b_j,b_i\) are on the same side of \(l_i'\). Now we further have two sub-cases.

  • \(l_i'\) has no dual point (i.e., \(l_i'\) is vertical) or its dual point \((l_i')^*\) is not on the ray r. In this case, consider a point p moving along r from \(r_0\) to the infinite end of r. Clearly, when p moves, the region \((l_i')^\rightarrow \cap (p^*)^\uparrow \) expands. Thus, either \(r \cap W_j\) is empty or it contains the infinite end of r. (See Fig. 6, f and g.)

  • \((l_i')^*\) is on r. Then \(r \cap W_j\) may be a connected portion of r containing neither \(r_0\) nor the infinite end of r. However, as \(b_i \in (l_i')^\uparrow \) in this case, we have \(a_j,b_j \in (l_i')^\uparrow \) (recall that \(a_j,b_j,b_i\) are on the same side of \(l_i'\)). This implies that \(r \cap W_j\) contains \((l_i')^*\). (See Fig. 6h.)

In sum, we conclude that for any \(j\in \{1,\dots ,i-1\}\), the intersection \(r \cap W_j\) might be (i) empty, or (ii) a connected portion of r containing \(r_0\), or (iii) a connected portion of r containing the infinite end of r, or (iv) a connected portion of r containing \((l_i')^*\) (if \((l_i')^*\) is on r). As such, the union \(\bigcup _{j=1}^{i-1}(r\cap W_j)\) can have at most three connected components, among which one contains \(r_0\), one contains the infinite end of r, and one contains \((l_i')^*\). Therefore, the complement of \(\bigcup _{j=1}^{i-1}(r\cap W_j)\) in r, i.e., \(r \cap F_{i-1}\), has at most two connected components. This in turn implies that \(\partial W_i\cap F_{i-1}\) has only a constant number of connected components, and hence \(|\varGamma _i|-|\varGamma _{i-1}|=O(1)\). Finally, since \(|\varGamma _0|=O(1)\) and \(m=O(n)\), we immediately have \(|\varGamma |=|\varGamma _m|=O(m)\). \(\square \)

With the above result in hand, we can build an optimal point-location data structure for \(\varGamma \) using O(m) space with \(O(\log m)\) query time to solve the RCP problem. Since \(m = O(n)\), we obtain an \({\mathcal {H}}\)-RCP data structure using O(n) space and \(O(\log n)\) query time in worst-case.

Next, we analyze the average-case bounds of the above data structure. In fact, it suffices to bound the expected number of the candidate pairs. Surprisingly, we have the following poly-logarithmic bound.

Lemma 5.2

For a random dataset \(S \propto R^n\) where R is an axes-parallel rectangle, \({\mathbb {E}}[|\varPhi (S,{\mathcal {H}})|] = O(\log ^2\! n)\).

5.1 Preprocessing

In this section, we consider how to construct the above data structure efficiently. The key step is to construct the subdivision \(\varGamma \) of the dual \({\mathbb {R}}^2\). Since \(|\varGamma | = O(n)\), once \(\varGamma \) is constructed, one can build in \(O(n \log n)\) time the point-location data structure for \(\varGamma \), and hence our \({\mathcal {H}}^\uparrow \)-RCP data structure.

Let us first consider an easier task, in which \(\varPhi (S,{\mathcal {H}}^\uparrow )\) is already given beforehand. In this case, we show that \(\varGamma \) can be constructed in \(O(n \log n)\) time. As in Sect. 5, suppose \(\varPhi (S,{\mathcal {H}}^\uparrow ) = \{\phi _1,\dots ,\phi _m\}\) where \(\phi _1,\dots ,\phi _m\) are sorted in increasing order of their lengths. Recall that in Sect. 5 we defined the m subdivisions \(\varGamma _0,\dots ,\varGamma _m\). Our basic idea for constructing \(\varGamma \) is to begin with \(\varGamma _0\) and iteratively construct \(\varGamma _i\) from \(\varGamma _{i-1}\) by inserting the wedge \(W_i\) dual to \(\phi _i\). In this process, a crucial thing is to maintain the outer face \(F_i\) (or its boundary). Note that the boundary \(\partial F_i\) of \(F_i\) (i.e., the upper envelope of \(F_i\)) is an x-monotone polygonal chain consisting of segments and two infinite rays; we call these kinds of chains left-right polylines and call their pieces fractions. Naturally, a binary search tree can be used to store a left-right polyline; the keys are its fractions in the left-right order. Therefore, we shall use a (balanced) BST \({\mathcal {T}}\) to maintain \(\partial F_i\). That is, at the end of the i-th iteration, we guarantee the left-right polyline stored in \({\mathcal {T}}\) is \(\partial F_i\). At each node of \({\mathcal {T}}\), besides storing the corresponding fraction, we also store the wedge \(W_j\) which contributes this fraction.

Suppose we are now at the beginning of the i-th iteration. We have \(\varGamma _{i-1}\) in hand and \({\mathcal {T}}\) stores \(\partial F_{i-1}\). We need to “insert” the wedge \(W_i\) to generate \(\varGamma _i\) from \(\varGamma _{i-1}\), and update \({\mathcal {T}}\). To this end, the first step is to compute \(\partial W_i \cap F_{i-1}\). Now let us assume in advance that \(\partial W_i \cap F_{i-1}\) is already computed in \(O(\log {|{\mathcal {T}}|})\) time; later we will explain how to achieve this. With \(\partial W_i \cap F_{i-1}\) in hand, to construct \(\varGamma _i\) is fairly easy. By the proof of Lemma 5.1, \(\partial W_i \cap F_{i-1}\) has O(1) connected components. We consider these components one-by-one. Let \(\xi \) be a component, which is an x-monotone polygonal chain with endpoints (if any) on \(\partial F_{i-1}\) (indeed, \(\xi \) consists of at most two pieces as it is a portion of \(\partial W_i\)). For convenience, assume \(\xi \) has a left endpoint u and a right endpoint v. Then \(\xi \) contributes a new (inner) face to \(\varGamma _i\), which is the region bounded by \(\xi \) and the portion \(\sigma \) of \(\partial F_{i-1}\) between uv. We then use \({\mathcal {T}}\) to report all the fractions of \(\partial F_{i-1}\) that intersect \(\sigma \) in left-right order, using which the corresponding new face can be directly constructed. The time cost for reporting the fractions is \(O(\log {|{\mathcal {T}}|}+ k)\) where k is the number of the reported fractions, since they are consecutive in \({\mathcal {T}}\).

After all the components are considered, we can construct \(\varGamma _i\) by adding the new faces to \(\varGamma _{i-1}\) (and adjusting the involved edges/vertices if needed). As there are O(1) components, the total time cost for constructing \(\varGamma _i\) from \(\varGamma _{i-1}\) is \(O(\log |{\mathcal {T}}| + K_i)\), where \(K_i\) is the total number of the fractions reported from \({\mathcal {T}}\). But we can charge the reported fractions to the corresponding new faces, and the fractions charged to each face are at most as many as its edges. Therefore, \(\sum _{i=1}^m K_i = O(m)\), and this part of the time cost is amortized \(O(\log m)\) for each iteration.

The remaining task is to update the left-right polyline \({\mathcal {T}}\). At the beginning of the i-th iteration, \({\mathcal {T}}\) stores \(\partial F_{i-1}\), and we need to modify it to \(\partial F_i\). Clearly, \(\partial F_i\) is obtained by using the connected components of \(\partial W_i \cap F_{i-1}\) to replace the corresponding portions of \(\partial F_{i-1}\). We consider the components of \(\partial W_i \cap F_{i-1}\) one-by-one (there are constant number of components to be considered by the proof of Lemma 5.1). Let \(\xi \) be a component, which must be an x-monotone polygonal chain consisting of at most two pieces. For convenience, assume \(\xi \) has a left endpoint u and a right end point v. It is clear that \(u,v \in \partial F_{i-1}\). We need to replace the portion \(\sigma \) of \(\partial F_{i-1}\) between uv with \(\xi \); we call this a Replace operation. To achieve this, we first report the fractions of \(\partial F_{i-1}\) intersecting \(\sigma \), by using the approach described above. Suppose the reported fractions are \(\gamma _1,\dots ,\gamma _k\) sorted in the left-right order. Then \(u \in \gamma _1\) and \(v \in \gamma _k\). Clearly, the fractions \(\gamma _2,\dots ,\gamma _{k-1}\) should be removed, as they disappear after replacing \(\sigma \) with \(\xi \). This can be done by deleting the corresponding nodes from \({\mathcal {T}}\) via \(k-2\) BST-deletion operations. Also, we need to modify \(\gamma _1\) and \(\gamma _k\): the portion of \(\gamma _1\) (resp., \(\gamma _k\)) to the right (resp., left) of u (resp., v) should be “truncated”. This can be done by directly updating the information stored in the two corresponding nodes. Finally, \(\xi \) should be inserted. Each piece of \(\xi \) becomes a new fraction, for which we create a new node storing the information of the fraction and insert it into \({\mathcal {T}}\) via a BST-insertion operation.

Now we analyze the time cost of this Replace operation. Let \(|{\mathcal {T}}|\) be the size of \({\mathcal {T}}\) before the operation. The time cost for reporting is \(O(\log {|{\mathcal {T}}|}+k)\). Removing \(\gamma _2,\dots ,\gamma _{k-1}\) takes \(O(k\log {|{\mathcal {T}}|})\) time. Modifying \(\gamma _1,\gamma _k\) and inserting \(\xi \) takes \(O(\log {|{\mathcal {T}}|})\) time (note that \(\xi \) has at most two pieces). So the total time of this Replace operation is \(O(k \log {|{\mathcal {T}}|})\). If \(k \le 2\), then the time cost is just \(O(\log {|{\mathcal {T}}|})\). If \(k>2\), we observe that there are \(\varOmega (k)\) nodes deleted from \({\mathcal {T}}\) in this Replace operation. Note that the total number of the nodes deleted from \({\mathcal {T}}\) cannot exceed the total number of the nodes inserted. Over the m iterations, we have in total O(m) Replace operations, each of which inserts O(1) nodes into \({\mathcal {T}}\). Therefore, one can delete at most O(m) nodes from \({\mathcal {T}}\) in total. It follows that the total time cost for all Replace operations is \(O(m \log m)\), which is also the total time cost for updating \({\mathcal {T}}\). In other words, \({\mathcal {T}}\) can be updated in amortized \(O(\log m)\) time for each iteration. As such, the overall time cost for constructing \(\varGamma \) is \(O(m\log m)\), and thus \(O(n\log n)\).

We now explain the missing part of the above algorithm: computing \(\partial W_i \cap F_{i-1}\) in \(O(\log {|{\mathcal {T}}|})\) time. Let r be the left ray of \(\partial W_i\) and \(r_0\) be the initial point of r (i.e., the vertex of \(W_i\)). It suffices to compute \(r \cap F_{i-1}\). Recall that \(l_i\) is the line through \(a_i,b_i\) and \(l_i'\) is the line through \(a_i\) perpendicular to \(l_i\). Assume \((l_i')^*\in r\) (the case that \((l_i')^* \not \in r\) is in fact easier). The point \((l_i')^*\) partitions r into a segment \(s = [r_0,(l_i')^*]\) and a ray \(r'\) emanating from \((l_i')^*\), where \(r'\) is to the left of s. By the proof of Lemma 5.1, each wedge \(W_j\) for \(j \in \{1,\dots ,i-1\}\) with \(W_j\cap r\ne \emptyset \) satisfies at least one of the following: (1) \(r_0 \in W_j\), (2) \((l_i')^* \in W_j\), (3) \(W_j\) contains the infinite end of r. Therefore, \(r\cap F_{i-1}\) can have one or two connected components; if it has two components, one should be contained in \(r'\) and the other should be contained in s. As such, \(r'\) contains at most one left endpoint and one right endpoint of (some component of) \(r \cap F_{i-1}\), so does s. We show that one can find these endpoints by searching in \({\mathcal {T}}\).

Suppose we want to find the left endpoint z contained in \(r'\) (assume it truly exists). Let \(\gamma \) be a fraction of \(\partial F_{i-1}\) which is contributed by \(W_j\) for \(j\in \{1,\dots ,i-1\}\). It is easy to verify that \(\gamma \) contains z iff \(\gamma \) intersects \(r'\) and \(W_j\) contains the infinite end of r. Also, \(\gamma \) is to the left of z iff \(\gamma \subseteq R\) and \(W_j\) contains the infinite end of r, where R is the region to the left of \((l_i')^*\) and above \(r'\). As such, one can simply search in \({\mathcal {T}}\) to find the fraction \(\gamma \) containing z in \(O(\log {|{\mathcal {T}}|})\) time, if z truly exists. (If z does not exist, by searching in \({\mathcal {T}}\) we can verify its non-existence, as we can never find the desired fraction \(\gamma \).) The right endpoint contained in \(r'\) and the left/right endpoints contained in s can be computed in a similar fashion. With these endpoints in hand, one can compute \(r \cap F_{i-1}\) straightforwardly. The other case that \((l_i')^* \notin r\) is handled similarly and more easily, as in this case \(r\cap F_{i-1}\) has at most one connected component. Therefore, \(r \cap F_{i-1}\) (and thus \(\partial W_i\cap F_{i-1}\)) can be computed in \(O(\log {|{\mathcal {T}}|})\) time.

Next, we consider how to construct \(\varGamma \) if we are only given the dataset S. Abam et al. [1] showed that one can compute in \(O(n \log ^2\! n)\) time an \({\mathcal {H}}^\uparrow \)-local t-spanner \(\varPsi \) of S for some \(t < 2\) such that \(|\varPsi | = O(n \log n)\); see Sect. 2 for the definition of local spanners. As argued in Sect. 2, we have \(\varPhi (S,{\mathcal {H}}^\uparrow ) \subseteq \varPsi \). Suppose \(\varPsi = \{\psi _1,\dots ,\psi _M\}\) where \(\psi _1,\dots ,\psi _M\) are sorted in increasing order of their lengths. The m candidate pairs \(\phi _1,\dots ,\phi _m \in \varPhi (S,{\mathcal {H}}^\uparrow )\) are among \(\psi _1,\dots ,\psi _M\). Let \(i_1< \ldots < i_m\) be indices such that \(\phi _1 = \psi _{i_1},\dots ,\phi _m = \psi _{i_m}\) (note that at this point we do not know what \(i_1,\dots ,i_m\) are). We shall consider \(\psi _1,\dots ,\psi _M\) in order.

When considering \(\psi _i\), we want to verify whether \(\psi _i\) is a candidate pair or not. If this can be done, the candidate pairs \(\phi _1,\dots ,\phi _m\) will be found in order. Whenever a new candidate pair \(\phi _k\) is found, we construct \(\varGamma _k\) from \(\varGamma _{k-1}\) in \(O(\log m)\) time by the approach above. Now assume \(\psi _1,\dots ,\psi _{i-1}\) are already considered, the candidate pairs in \(\{\psi _1,\dots ,\psi _{i-1}\}\) are recognized (say they are \(\phi _1,\dots ,\phi _{k-1}\)), and \(\varGamma _{k-1}\) is constructed. We then consider \(\psi _i\). We need to see whether \(\psi _i\) is a candidate pair, i.e., whether \(\psi _i = \phi _k\). Let W be the corresponding wedge of \(\psi _i\) in the dual \({\mathbb {R}}^2\). Observe that \(\psi _i = \phi _k\) iff \(W \nsubseteq \bigcup _{j=1}^{k-1} W_j\). Indeed, if \(\psi _i = \phi _k\), then \(W=W_k\) and hence \(W \nsubseteq \bigcup _{j=1}^{k-1} W_j\) (for \(\phi _k\) is a candidate pair). Conversely, if \(W \nsubseteq \bigcup _{j=1}^{k-1} W_j\), then their exists some halfplane \(H \in {\mathcal {H}}^\uparrow \) such that H contains \(\psi _i\) and does not contain \(\phi _1,\dots ,\phi _{k-1}\). Then the closest-pair in \(S \cap H\) cannot be in \(\{\psi _1,\dots ,\psi _{i-1}\}\) but must be in \(\varPsi \), hence it is nothing but \(\psi _i\).

Based on this observation, we can verify whether \(\psi _i = \phi _k\) as follows. We assume \(\psi _i = \phi _k\) and try to use it to construct \(\varGamma _k\) from \(\varGamma _{k-1}\) by our above approach. If our assumption is correct, then \(\varGamma _k\) is successfully constructed in \(O(\log m)\) time. Furthermore, in the process of constructing \(\varGamma _k\), our approach allows us to find a point in \(W\setminus \bigcup _{j=1}^{k-1} W_j\), which we call witness point. This witness point then evidences the correctness of our assumption. On the other hand, if our assumption is wrong, the process can still terminate in \(O(\log m)\) time, but we can never find such a witness point because \(W\subseteq \bigcup _{j=1}^{k-1} W_j\). In this case, we just discard \(\psi _i\) and continue to consider \(\psi _{i+1}\). After considering all pairs in \(\varPsi \), we recognize all the m candidate pairs and \(\varGamma = \varGamma _m\) is constructed. Since \(m = O(n)\) and \(M = |\varPsi | = O(n \log n)\), the overall process takes \(O(n\log ^2\! n)\) time.

Theorem 5.3

There exists an \({\mathcal {H}}\)-RCP data structure \({\mathcal {E}}\) such that

  • For any \(S \subseteq {\mathbb {R}}^2\) of size n,

    $$\begin{aligned} { {\mathsf {Space}}}({\mathcal {E}}(S)) = O(n)\quad \text {and}\quad { {\mathsf {Qtime}}}({\mathcal {E}}(S)) = O(\log n). \end{aligned}$$
  • For a random \(S \propto R^n\) where R is the unit square or more generally an arbitrary axes-parallel rectangle,

    $$\begin{aligned} {\mathbb {E}}[{ {\mathsf {Space}}}({\mathcal {E}}(S))] = O(\log ^2\! n)\qquad \text {and}\qquad {\mathbb {E}}[{ {\mathsf {Qtime}}}({\mathcal {E}}(S))] = O(\log \log n). \end{aligned}$$

Furthermore, the above data structure can be constructed in \(O(n \log ^2\! n)\) worst-case time.

5.2 Application to the \({\mathcal {H}}\)-RSS Problem

Interestingly, our approach for solving the \({\mathcal {H}}\)-RCP problem can also be applied to the \({\mathcal {H}}\)-RSS problem, and leads to an optimal \({\mathcal {H}}\)-RSS data structure for interior-disjoint (i.e., non-crossing) segments. (Recall that the \({\mathcal {H}}\)-RSS problem is to build a data structure for a set of line segments in the plane so that the shortest one contained in a query halfplane can be reported efficiently.) This by-product is of indepedent interest.

Theorem 5.4

There exists an \({\mathcal {H}}\)-RSS data structure \({\mathcal {F}}\) such that for any set G of n interior-disjoint (i.e., non-crossing) segments in \({\mathbb {R}}^2\), \({ {\mathsf {Space}}}({\mathcal {F}}(G)) = O(n)\), \({ {\mathsf {Qtime}}}({\mathcal {F}}(G)) = O(\log n)\), and \({\mathcal {F}}(G)\) can be constructed in \(O(n \log n)\) time.

Proof

The data structure is basically identical to the \({\mathcal {H}}\)-RCP data structure given in Sect. 5. Let \(\sigma _1,\dots ,\sigma _n\) be the interior-disjoint segments in G sorted in increasing order of their lengths. Suppose \(W_i\) is the wedge dual to \(\sigma _i\). We successively overlay the wedges \(W_1,\dots ,W_n\) to create a subdivision \(\varGamma \) of the dual space, as what we do in Sect. 5 for the candidate pairs. A point-location data structure on \(\varGamma \) is then our \({\mathcal {H}}\)-RSS data structure for G. Note that Lemma 5.1 can be applied to show \(|\varGamma | = O(n)\), because when proving Lemma 5.1 we only used the facts that the candidate pairs do not cross each other and the wedges are inserted in increasing order of the lengths of their corresponding candidate pairs (here the segments \(\sigma _1,\dots ,\sigma _n\) are also non-crossing and sorted in increasing order of their lengths). As such, the space cost of the data structure is O(n) and the query time is \(O(\log n)\). In the previous section, we have shown that if the candidate pairs are already given, our \({\mathcal {H}}\)-RCP data structure can be built in \(O(n \log n)\) time. It follows that our \({\mathcal {H}}\)-RSS data structure can be built in \(O(n \log n)\) time, as we are directly given the segments in this case. \(\square \)

6 Conclusion and Future Work

In this paper, we revisited the range closest-pair (RCP) problem, which aims to preprocess a set S of points in \({\mathbb {R}}^2\) into a data structure such that when a query range X is specified, the closest-pair in \(S \cap X\) can be reported efficiently. We proposed new RCP data structures for various query types (including quadrants, strips, rectangles, and halfplanes). Both worst-case and average-case analyses were applied to these data structures, resulting in new bounds for the RCP problem. See Table 1 for a comparison of our new results with the previous work.

We now list some open questions for future study. First, the RCP problem for other query types is an interesting open question. One important example is the disk query, which is usually much harder than the rectangle query and halfplane query in traditional range search. For an easier version, we can focus on the case where the query disks have a fixed radius, or equivalently, the query ranges are translates of a fixed disk.

Along this direction, one can also consider translation queries of some shape other than a disk. For instance, if the query ranges are translates of a fixed rectangle, can we have more efficient data structures than our rectangle RCP data structure in Sect. 4? Recently, we have achieved some results in this direction [14].

Also, the RCP problem in higher dimensions is open. To our best knowledge, the only known result for this is a simple data structure given in [6] constructed by explicitly storing all the candidate pairs, which only has guaranteed average-case performance.

Finally, one can consider a colored generalization of the RCP search, in which the given points are colored and we want to compute the closest bichromatic pair of points contained in a query range. We have studied the colored RCP problem in a very recent work [11], but only approximate data structures are achieved. It is interesting to see how to solve the colored RCP problem exactly.