Keywords

1 Introduction

The approximate nearest neighbor problem, \(\epsilon \)-NN for short, can be defined as follows: given a set P of points in a metric space S equipped with a distance function D, and a query point \(q \in S\), find a point \(p\in P\) such that \(D(p, q)\le (1+\epsilon ) D(p^*, q)\), where \(p^*\) has the minimal distance to q in P. \(\epsilon \)-NN is one of the most important proximity problems in computation geometry. Many proximity problems in computation geometry can be reduced to \(\epsilon \)-NN [11], such as approximate diameter, approximate furthest neighbor, and so on. \(\epsilon \)-NN is also important in many other areas, such as databases, data mining, information retrieval and machine learning.

Due to its importance, \(\epsilon \)-NN has been the subject of substantial research efforts. Many algorithms for solving \(\epsilon \)-NN have been discovered. These works can be summarized into four classes.

The first class of the algorithms tries to build data structures that support solving \(\epsilon \)-NN efficiently. Arya et al. [4] give a such algorithm with query time \(1/\epsilon ^{O(d)}\cdot \log {n}\), space \(1/\epsilon ^{O(d)}\cdot n\) and preprocessing time \(1/\epsilon ^{O(d)}\cdot n\log {n}\). Another work [5] gives an algorithm requiring O(dn) space and \(O(dn\log {n})\) preprocessing time but query time as high as \((d/\epsilon )^{O(d)}\cdot \log {n}\). Kleinberg proposes two algorithms in [15]. The first algorithm is deterministic and achieves query time of \(O(d\log ^2{d}(d+\log {n}))\), using a data structure that requires \(O((n\log {d})^{2d})\) space and \(O((n\log {d})^{2d})\) preprocessing time. The second algorithm is a randomized version of the first one. By a preprocessing procedure that takes \(O(d^2\log ^2{d}\cdot n\log ^2{n})\) time, it reduces the storage requirement to \(O(dn\cdot \log ^3{n})\), but raises the query time up to \(O(n+d\log ^3{n})\).

The second class of the algorithms considers the situation of \(\epsilon =d^{O(1)}\). One such algorithm is given in [6]. It can answer \(O(\sqrt{d})\)-NN in \(O(2^d\log {n})\) time with \(O(d8^dn\log {n})\) preprocessing time and \(O(d2^dn)\) space. Chan [8] improves this result by giving an algorithm that can answer \(O(d^{3/2})\)-NN in \(O(d^2\log {n})\) query time with \(O(d^2n\log {n})\) preprocessing time and \(O(dn\log {n})\) space.

The third interesting class of work tries to solve \(\epsilon \)-NN by inspecting some intrinsic dimension of the input point set P. An exemplar work is in [16]. The paper gives an algorithm whose query time is bounded by \(2^{O(dim(P))}\log {\varDelta }+ (1/\epsilon )^{O(dim(P))}\), where dim(P) is the intrinsic dimension of the input point set P, and \(\varDelta \) is the diameter of P.

Besides these algorithms mentioned above, Indyk et al. [14] initiate the work on the fourth class of algorithms. The key idea is to define an Approximate Near Neighbor problem, denoted as (cr)-NN, and reduce \(\epsilon \)-NN to it. The (cr)-NN problem can be viewed as a decisive version of \(\epsilon \)-NN. The formal definition of (cr)-NN is give in Definition 2 in the next section.

To use this method to solve \(\epsilon \)-NN, two parts of problem must be considered. One is how to solve (cr)-NN, and the other is how to reduce \(\epsilon \)-NN to (cr)-NN. Some works about the two parts of problem are discussed below. Our study focuses on the latter part.

Algorithms to Solve (\({\varvec{c,r}}\))-NN. The existing algorithms for (cr)-NN mainly consider the specific situation of d-dimensional Euclidean space with 1-order and 2-order Minkowski distance metrics. Each input point x is given in the form of \((x_1,\cdots ,x_d)\). And q-order Minkowski \(L_q\) distance between points x and y is given by \(D(x,y)=\left( \sum \limits _{i=1}^d{|x_i-y_i|^q}\right) ^{\frac{1}{q}}\). The 1-order and 2-order Minkowski distance are well-known Manhattan distance and Euclidean distance, respectively. Another simpler situation, which is the (cr)-NN problem under Hamming cube \(\{0,1\}^d\) equipped with Hamming distance, is usually considered in theoretical studies.

Table 1 summarizes the complexities of the existing algorithms for (cr)-NN under Euclidean space and \(L_1\) distance. These papers also give solutions under \(L_2\) distance, but we omit these results due to space limitation. Usually the complexities under \(L_2\) distance is higher than that under \(L_1\) distance. It is a key characteristic of the existing algorithms for (cr)-NN that they usually have different complexities for problems under different order of Minkowski distance metrics.

Table 1. Solutions to (cr)-NN under Euclidean space and \(L_1\) distance.

The listed solutions in Table 1 can be divided into three groups. The first group includes the one given in [14], which is deterministic, and the other groups are randomized. The advantage of randomization is that the exponential complexity about d is freed. The second group includes the one given in [17], which is based on a random projection method proposed in [15]. One distinguished characteristic of the method is that the data structure building stage is also randomized. The last group includes a long line of research work based on Locality Sensitive Hashing (LSH), which is first proposed in [14]. These works are summarized into three terms in Table 1, which can be viewed as the space-time trade-off under LSH framework.

Finally, comparing the five results in Table 1, it can be seen that the query time grows and the space requirement drops from the first one to the last. The five results form a general space-time trade-off about the solution to (cr)-NN.

Reducing \(\varvec{\epsilon }\)-NN to (\({\varvec{c,r}}\))-NN. So far there are three different algorithms for such a reduction. Two of the three algorithms are deterministic [12, 14], and the other one is randomized [13]. The complexities of the three reduction algorithms are summarized in Table 2. Note that query time in Table 2 is the number of invocations of (cr)-NN algorithm. And the preprocessing time about [14] is not given because there is no such analysis in that paper.

Table 2. Comparison of three reductions.

Among the three reduction algorithms, the one proposed in [13] need to be explained in detail. First, the algorithm outputs a point \(p'\) such that \(D(q,p')\le c(1+\gamma )^2D(q,p^*)\), where \(c=1+\epsilon \) and \(p^*\) is the exact NN of q. Second, the T(ncf), Q(ncf), D(ncf) and S(ncf) functions represent the complexity functions of the data structure building time, query time, update time and storage usage for (cr)-NN, respectively. Third, the parameter f is the failure probability of one (cr)-NN invocation, and is selected so that \(f\log {n}\) is a constant less that 1.

The fourth and the most important point about [13] is the \(O(\log ^{O(1)}{n})\) query time. The algorithm given in [13] explicitly invokes \(O(\log {n})\) times of (cr)-NN, and each invocation needs T(ncf) time. As explained above, the parameter f, which is the failure probability of one (cr)-NN invocation, is set to \(O(\frac{1}{\log {n}})\). Note that the algorithms for (cr)-NN given in Table 1 all have constant failure probabilityFootnote 1. In order to satisfy the requirement of \(O(\frac{1}{\log {n}})\) failure probability of one (cr)-NN invocation, each time the algorithm in [13] invokes (cr)-NN, the algorithms for (cr)-NN with constant failure probability must be executed multiple times, which is \(O(\log ^{O(1)}{n})\) times in expectation. Multiplying \(O(\log {n})\) invocations of (cr)-NN and \(O(\log ^{O(1)}{n})\) executions of (cr)-NN algorithm for each invocation, we obtain that the algorithm in [13] actually invokes \(O(\log ^{O(1)}{n})\) times of (cr)-NN algorithm. This observation is confirmed in [2].

Our Method. We propose a new algorithm in this paper for reducing \(\epsilon \)-NN to (cr)-NN. Comparing with the former works [12,13,14], our algorithm has the following characteristics:

  1. (1)

    It achieves \(O(\log {n})\) query time, counted in the number of invocations of (cr)-NN algorithm. It is superior to all the other three works. This is the most distinguished contribution of this paper.

  2. (2)

    Its preprocessing time is \(O((\frac{d}{\epsilon })^d\cdot n\log {n})\), and the space complexity is \(O((\frac{d}{\epsilon })^d\cdot n)\). Our method has better complexity than the other three works in terms of n, so that it is much suitable to big data with low or fixed dimensionality. This situation is plausible in many applications like road-networks and so on.

  3. (3)

    In terms of the parameterized complexity treating d as a constant, our result is the closest to the well recognized optimal complexity claimed in [5], which requires \(O(n\log {n})\) preprocessing time, O(n) space and \(O(\log {n})\) query time.

Note that there is an \(O((d/\epsilon )^d)\) factor in our preprocessing and space complexity. This factor originates from a lemma we used in [20]. We point out that the upper bound \(O((d/\epsilon )^d)\) is actually very loose. There really is possibility to reduce the upper bound, and thus make our result more close to optimal. In this sense, our work is more promising than all the other three works. However, reducing the upper bound \(O((d/\epsilon )^d)\) is out of this paper’s scope, and is left as our future work.

2 Problem Definitions and Mathematical Preparations

2.1 Problem Definitions

We focus on \(\epsilon \)-NN in euclidean space \(R^d\). The input is a set P of n points extracted from \(R^d\) and a distance metric \(L_q\). Each point x is given as the form \((x_1,\cdots ,x_d)\). \(L_q\) distance metric between points x and y is given by \(D(x,y)=\left( \sum \limits _{i=1}^d{|x_i-y_i|^q}\right) ^{\frac{1}{q}}\).

Denote B(pr) to be the d-dimensional ball centered at p and with radius r. And let \(p'\in B(p,r)\) be equivalent to \(D(p',p)\le r\). We first give the definitions of \(\epsilon \)-NN and (cr)-NN problems.

Definition 1

(\(\epsilon \)-NN). Given a set P of points extracted from \(R^d\), a query point \(q\in R^d\), and an approximation factor \(\epsilon \), find a point \(p'\in P\) such that \(D(p',q)\le (1+\epsilon )D(p^*,q)\) where \(D(p^*,q)=\min \limits _{p\in P}\{D(p,q)\}\).

Remark 1

\(p^*\) is called the nearest neighbor (NN), or exact NN to q, and \(p'\) is called an \(\epsilon \)-NN to q.

Definition 2

((cr)-NN). Given a set P of points extracted from \(R^d\), a query point \(q\in R^d\), a query range r, and an approximation factor \(c>1\), (cr)-NN problem is to design an algorithm satisfying these:

  1. 1.

    if there is a point \(p_0\in P\) satisfying \(p_0\in B(q,r)\), then return a point \(p' \in P\) such that \(p'\in B(q,c \cdot r)\);

  2. 2.

    if \(D(p,q) > c\cdot r\) for \(\forall p \in P\), then return No.

Remark 2

There are multiple names referring to the same problem defined above. In the papers related to LSH, it is referred as (cr)-NN. In [14], it is called approximate Point Location in Equal Balls, which is denoted as \(\epsilon \)-PLEB where \(\epsilon =c-1\). In more recent papers like [12], it is called Approximate Near Neighbor problem.

Next we give the definition of the reduction problem to be solved in this paper, i.e., the problem of reducing \(\epsilon \)-NN to (cr)-NN.

Definition 3

(Reduction Problem). Given a set P of points extracted from \(R^d\), a query point \(q\in R^d\), an approximation factor \(\epsilon \), and an algorithm \(\mathcal {A}\) for (cr)-NN, the reduction problem is to find an \(\epsilon \)-NN to q by invoking the algorithm \(\mathcal {A}\) as an oracle.

Remark 3

To solve the reduction problem, a preprocessing phase is usually needed, which is to devise a data structure \(\mathcal {D}\) based on P. Thus the problem of reducing \(\epsilon \)-NN to (cr)-NN is divided into two phases. The first is data structure building phase, or preprocessing phase. The second is \(\epsilon \)-NN searching phase, or query phase. The (cr)-NN algorithm \(\mathcal {A}\) is invoked in query phase as an oracle, which characterizes the algorithm as a Turing reduction from \(\epsilon \)-NN to (cr)-NN.

The time complexity of the algorithm for the reduction problem consists of two parts, namely, preprocessing time complexity and query time complexity. An important note is that the query time complexity is the number of invocations of (cr)-NN algorithm \(\mathcal {A}\). This is the well recognized method for analyze the time complexity of a Turing reduction.

2.2 Mathematical Preparations

In this section we introduce some denotations and lemmas to support the idea of our algorithm for reducing \(\epsilon \)-NN to (cr)-NN.

Denotations. Define a box \(\mathfrak {b}\) in \(R^d\) to be the product of d intervals, i.e., \(I_1\times I_2\times \cdots \times I_d\) where \(I_i\) is either open, closed or semi-closed interval, \(1\le i\le m\). A box is cubical iff all the d intervals defining the box are of the same length. The side length a cubical box, which is the length of any interval defining the cubical box, is denoted as \(len(\mathfrak {b})\). A minimal cubical box (MCB) for a point set P, denoted as MCB(P), is a cubical box containing all the points in P and has the minimal side length. Note that MCB(P) may not be unique.

Given a point set P and a box \(\mathfrak {b}\), let \(p\in \mathfrak {b}\) denote that a point \(p\in P\) falls inside box \(\mathfrak {b}\), and let \(|\mathfrak {b}\cap P|\) denote the number of points in P that falls inside \(\mathfrak {b}\). We will use \(|\mathfrak {b}|\) for short, if not causing ambiguity.

Given a collection of MCBs \(\mathcal {B}=\{\mathfrak {b}_1,\cdots ,\mathfrak {b}_m\}\), define \(D_{max}(\mathfrak {b})\), \(D_{min}(\mathfrak {b},\mathfrak {b}')\), \(D_{max}(\mathfrak {b}, \mathfrak {b}')\) as follows:

$$D_{max}(\mathfrak {b})=\max \limits _{p_1,p_2\in \mathfrak {b}}{D(p_1,p_2)},\forall \mathfrak {b}\in \mathcal {B}$$
$$D_{min}(\mathfrak {b}, \mathfrak {b}')=\min \limits _{p\in \mathfrak {b},p'\in \mathfrak {b}'}\{D(p,p')\}, \forall \mathfrak {b},\mathfrak {b}'\in \mathcal {B}$$
$$D_{max}(\mathfrak {b}, \mathfrak {b}')=\max \limits _{p\in \mathfrak {b},p'\in \mathfrak {b}'}\{D(p,p')\}, \forall \mathfrak {b},\mathfrak {b}'\in \mathcal {B}$$

With the above denotations, define \(Est(\mathfrak {b})\) as follows:

$$\begin{aligned} Est(\mathfrak {b})= \left\{ \begin{aligned} D_{max}(\mathfrak {b})&,&if\;|\mathfrak {b}\cap P|\ge 2\\ \min \limits _{\mathfrak {b}'\in Nbr(\mathfrak {b})} \{D_{max}(\mathfrak {b},\mathfrak {b}')\}&,&otherwise \end{aligned} \right. \end{aligned}$$
(1)

where \(Nbr(\mathfrak {b}) = \left\{ \mathfrak {b}'\mid {D_{min} (\mathfrak {b}, \mathfrak {b}')\le r}\right\} \), and the parameter r should satisfy \(r\ge Est(\mathfrak {b})\).Footnote 2

For an MCB \(\mathfrak {b}\), we associate a ball with it. Pick an arbitrary point \(c_\mathfrak {b}\in \mathfrak {b}\), and let \(r_\mathfrak {b}=Est(\mathfrak {b})\), then we have a ball \(B(c_\mathfrak {b},r_\mathfrak {b})\). It is easily verified that every point in \(\mathfrak {b}\) is within a distance of \(Est(\mathfrak {b})\) from \(c_\mathfrak {b}\), in another way to say, the ball \(B(c_\mathfrak {b}, r_\mathfrak {b})\) encloses every point in \(\mathfrak {b}\). We call \(B(c_\mathfrak {b}, r_\mathfrak {b})\) the enclosing ball for box \(\mathfrak {b}\).

Next we start to introduce the lemmas while discussing different situations of \(\epsilon \)-NN search. In the following discussion, we will assume that we have an MCB \(\mathfrak {b}\) of the input point set P, an enclosing ball \(B(c_\mathfrak {b}, r_\mathfrak {b})\) of the MCB \(\mathfrak {b}\), and a query point q.

Situation 1. The first and an easy situation is that, if q is far enough from \(c_{\mathfrak {b}}\) then every point in \(\mathfrak {b}\) is an \(\epsilon \)-NN to q. The following value \(T_1(\mathfrak {b})\) explains the threshold for far enough, and Lemma 1 depicts the situation discussed above.

Definition 4

For an MCB of a point set P, define \(T_1(\mathfrak {b}) =(1+2/\epsilon )r_\mathfrak {b}\).

Lemma 1

If \(D(q,c_\mathfrak {b})\ge T_1(\mathfrak {b})\), then every point in \(\mathfrak {b}\) is an \(\epsilon \)-NN to q.

Proof

See the full paper [18] for the proof.    \(\square \)

Situation 2. If q is not as far from \(c_\mathfrak {b}\) as a distance of \(T_1(\mathfrak {b})\), i.e., \(D(q,c_{\mathfrak {b}}) < T_1(\mathfrak {b})\), then we split \(\mathfrak {b}\) into a set of sub-boxes \(\{\mathfrak {b}_1,\cdots , \mathfrak {b}_m\}\), and calculate the enclosing balls \(B(c_{\mathfrak {b}_i},r_{\mathfrak {b}_i})\) for each box \(\mathfrak {b}_i, 1\le i\le m\). The next situation is that if q is still far enough from each point in \(\{c_{\mathfrak {b}_1},\cdots , c_{\mathfrak {b}_m}\}\), i.e., the centers of the enclosing balls, then we can still tell that every point in \(\mathfrak {b}\) is an \(\epsilon \)-NN to q. We give another threshold \(T_2(\mathfrak {b})\) based on this idea, and formalize the idea into Lemma 2. This lemma also discusses the quantitative relationship between \(T_2(\mathfrak {b})\) and \(T_1(\mathfrak {b})\).

Definition 5

For an MCB of a point set P, split \(\mathfrak {b}\) into a set of sub-boxes \(\{\mathfrak {b}_1,\cdots , \mathfrak {b}_m\}\). Each of these sub-boxes is an MCB of a point set \(P'\subset P\). Then let \(B(c_{\mathfrak {b}_i},r_{\mathfrak {b}_i})\) be the enclosing ball of sub-box \(\mathfrak {b}_i\), \(1\le i \le m\). Define \(rmax_\mathfrak {b} = \max \limits _{i}{\{r_{\mathfrak {b}_i}\}}\). In case of \(|\mathfrak {b}|=1\), let \(rmax_\mathfrak {b}=0\).

Definition 6

Define \(T_2(\mathfrak {b})=r_\mathfrak {b} +(1+2/\epsilon )rmax_\mathfrak {b}\).

Lemma 2

We have the following statements:

  1. 1.

    if \(D(q,c_\mathfrak {b})\ge T_2(\mathfrak {b})\), then every point in \(\mathfrak {b}\) is \(\epsilon \)-NN to q;

  2. 2.

    if \(rmax_\mathfrak {b} < \frac{2}{2+\epsilon }r_\mathfrak {b}\), then \(T_2 (\mathfrak {b}) < T_1(\mathfrak {b})\).

Proof

See the full paper [18] for the proof.    \(\square \)

Situation 3. If q is still not as far from \(c_\mathfrak {b}\) as a distance of \(T_2(\mathfrak {b})\), it is time to ask the algorithm of (cr)-NN for help. Let \(\mathcal {A}(Q,\mathfrak {q},c,r)\) be any algorithm for solving (cr)-NN, where Q is the input point set, \(\mathfrak {q}\) is the query point, r is the query range, and c is the approximation factor. The meanings of these four parameters are already given in Definition 2. The goal of invoking \(\mathcal {A}\) is that, if \(\mathcal {A}\) answers No then still every point in \(\mathfrak {b}\) is an \(\epsilon \)-NN to q. The following lemma shows how to set the four input parameters to fulfill the goal.

Lemma 3

Let \(\mathcal {A}(Q,\mathfrak {q},c,r)\) be any algorithm for (cr)-NN. We have the following statements:

  1. 1.

    if we set \(Q=\{c_{\mathfrak {b}_1},\cdots c_{\mathfrak {b}_m}\}\), \(\mathfrak {q}=q\), \(r=\max \limits _{i} \{T_2 (\mathfrak {b}_i)\}\), and let c satisfy \(c\cdot r= \max \limits _{i}\{T_1 (\mathfrak {b}_i)\}\), and invoke \(\mathcal {A}(Q,\mathfrak {q},c,r)\), then if \(\mathcal {A}\) returns No, we can pick any point in \(\mathfrak {b}\) as the answer of \(\epsilon \)-NN to q;

  2. 2.

    if \(rmax_{\mathfrak {b}_i} <\frac{2}{2+\epsilon }r_{\mathfrak {b}_i}\) holds for each \(\mathfrak {b}_i\), \(1\le i\le m\), then our settings for c and r satisfy the requirement of (cr)-NN problem definition. i.e. \(c>1\).

Proof

See the full paper [18] for the proof.    \(\square \)

Situation 4. As what is said in Lemma 3, if the algorithm \(\mathcal {A}\) returns No then the search of \(\epsilon \)-NN terminates with returning an arbitrary point in \(\mathfrak {b}\). According to Definition 2, \(\mathcal {A}\) can also return some point \(c_{\mathfrak {b}_i}\in Q\) other than No. In that case the search must continues. At first glance, the same procedure should be recursively carried out, by applying Lemmas 1, 2, 3 one by one on box \(\mathfrak {b}_i\), where the point \(c_{\mathfrak {b}_i}\) returned by \(\mathcal {A}\) is the center of the enclosing ball of box \(\mathfrak {b}_i\). However, to guarantee that the algorithm returns a correct \(\epsilon \)-NN, the box considered by the algorithm must encloses the exact NN \(p^*\). But the box \(\mathfrak {b}_i\) may not enclose \(p^*\), which would ruin the correctness of the algorithm. Thus, we need to expand the search range to the boxes near to \(\mathfrak {b}_i\). The following Lemma 4 gives the bounds of the search range and ensures that \(p^*\) lies in the range.

Definition 7

For a collection of MCBs \(\mathcal {B}=\{\mathfrak {b}_1,\cdots , \mathfrak {b}_m\}\), let \(rmax_{\mathfrak {b}_i}\) be defined as Definition 5 for each \(\mathfrak {b}_i\), \(1\le i\le m\). Then define \(rmax_{\mathcal {B}}=\max \limits _{\mathfrak {b}_i\in \mathcal {B}}\{rmax_{\mathfrak {b}_i}\}\).

Definition 8

Define \(Nbr(\mathfrak {b})\) as

$$\left\{ \mathfrak {b}'\in \mathcal {B} \mid {D(c_{\mathfrak {b}'},c_{\mathfrak {b}}) \le (3+4\epsilon )rmax_{\mathcal {B}_{\mathfrak {b}_\mathrm {s}}}} \right\} $$

where \(\mathcal {B}_{\mathfrak {b}_\mathrm {s}}= Nbr(\mathfrak {b}_\mathrm {s})\) and \(\mathfrak {b}_\mathrm {s}\) is the super box of \(\mathfrak {b}\).

Remark 4

The definition of Nbr sets is a recursive definition. For a box \(\mathfrak {b}\), its \(Nbr(\mathfrak {b})\) set is defined based on the \(Nbr(\mathfrak {b}_s)\) set of its super box \(\mathfrak {b}_s\). It requires that the boxes are recursively split, which can be represented as a tree structure. The formal description of the tree structure is given in Sect. 3.1.

Lemma 4

Given the query point q, and a collection of boxes \(\{\mathfrak {b}_1,\cdots ,\mathfrak {b}_m\}\), if we find a box \(\mathfrak {b}_i\) satisfying \(D(q, c_{\mathfrak {b}_i})\le \max \limits _{i} \{T_1 (\mathfrak {b}_i)\}\), then the nearest neighbor of q lies in and can only lie in \(Nbr(\mathfrak {b}_i)\), i.e., \(p^*\in Nbr(\mathfrak {b}_i)\).

Proof

See the full paper [18] for the proof.    \(\square \)

We are done introducing the mathematical preparations. In the next section we will propose our algorithm based on the lemmas given above.

3 Algorithms

In this section we propose our algorithm for reducing \(\epsilon \)-NN to (cr)-NN, including the preprocessing and query algorithm.

3.1 Preprocessing

Our preprocessing algorithm mainly consists of two sub-procedures. One is to build the box split tree T, and the other is to construct the Nbr sets.

Building the Box Split Tree. We first give the definition of the box split tree.

Definition 9

(Box split tree). Given a point set P and its MCB \(\mathfrak {b}_P\), a tree T is a box split tree iff:

  1. 1.

    the root of T is \(\mathfrak {b}_P\);

  2. 2.

    each non-root node of T is an MCB of a point set \(P'\subset P\);

  3. 3.

    if box \(\mathfrak {b}'\) is a sub-box of \(\mathfrak {b}\), then there is an edge between the node for \(\mathfrak {b}\) and the node for \(\mathfrak {b}'\) in T;

  4. 4.

    each node has at least 2 child nodes, and at most |P| child nodes;

  5. 5.

    \(rmax_{\mathfrak {b}}< \frac{2}{2+\epsilon } r_{\mathfrak {b}}\) holds for each box \(\mathfrak {b}\) in T.

Further, T is fully built iff each box at the leaf nodes of T contains only one point.

Remark 5

The fifth term is required by the second statement of Lemma 3.

figure a

We use a box split method to build the box split tree. This method is originally proposed in [20], and also used in several other papers [7, 10]. It starts from the MCB \(\mathfrak {b}_P\) of the point set P, then continuously splits \(\mathfrak {b}_P\) into a collection \(\mathcal {B}\) of cubical boxes until each box in \(\mathcal {B}\) contains only one point. The method proceeds in a series of split steps. In each split step, the box \(\mathfrak {b}_L\) with the largest side length in the current collection \(\mathcal {B}\) is chosen and split. Define \(h_i(\mathfrak {b})\) to be the hyperplane orthogonal to the i-th coordinate axis and passing through the center of \(\mathfrak {b}\). One split step will split \(\mathfrak {b}_L\) into at most \(2^d\) sub-boxes using all \(h_i(\mathfrak {b}_L)\), each of which is an MCB. The set of non-empty sub-boxes generated by conducting one split step on \(\mathfrak {b}\) is denoted as \(Succ(\mathfrak {b})\). The details of the box splitting method can be found in [20].

Next we describe how to use this method to build the box split tree T. The main obstacle is to satisfy the fifth term in Definition 9, i.e., \(rmax_\mathfrak {b}< \frac{2}{2+\epsilon }r_\mathfrak {b}\) for each box \(\mathfrak {b}\) in T. We use the following techniques to solve this problem.

When a split step is executed and a box \(\mathfrak {b}\) is split, we temporarily store the sub-boxes of \(\mathfrak {b}\) in a max-heap \(H_\mathfrak {b}\), which is ordered on the side length of the boxes in the heap. Recall the definitions in Sect. 2.2, the side length of a box \(\mathfrak {b}\) is denoted as \(len(\mathfrak {b})\). When box \(\mathfrak {b}\) is split fine enough so that \(rmax_\mathfrak {b}< \frac{2}{2+\epsilon }r_\mathfrak {b}\) is satisfied, the algorithm will create a node for each \(\mathfrak {b}'\in H_\mathfrak {b}\), and hang it under the node for box \(\mathfrak {b}\) in the box split tree T. Then for each \(\mathfrak {b}'\) at these newly created leaf nodes, a max-heap is created to store its sub-boxes. In an overview, a max-heap is maintained for each box at the leaf nodes of the box split tree.

In each split step, the box with the largest volume is split. To efficiently pick out this box, a secondary heap \(H_2\) is maintained. The heaps for the leaf nodes are called the primary heaps in contrast. The elements in \(H_2\) is just the top elements in each primary heap, together with a pointer to its corresponding primary heap. Apparently the top element \(\mathfrak {b}_{top}\) in \(H_2\) is the box with largest volume. When \(\mathfrak {b}_{top}\) is picked, the primary and secondary heap will pop it out simultaneously. Then \(\mathfrak {b}_{top}\) is split by conducting one split step on it, generating \(Succ(\mathfrak {b}_{top})\). These sub-boxes in \(Succ(\mathfrak {b}_{top})\) will be added into the primary heap where \(\mathfrak {b}_{top}\) formerly resides. When this primary heap finishes maintaining, its top element is inserted into the secondary heap. And then the iteration continues.

We point out the last problem to solve in order to satisfy the fifth term in Definition 9. The heaps, including the primary heaps and the secondary heap, are organized according to the len value of the boxes, in order to retrieve the box with the largest volume. On the other hand, the condition of \(rmax_\mathfrak {b}<\frac{2}{2+\epsilon }r_\mathfrak {b}\) is based on the Est value of the boxes, because here we have \(rmax_\mathfrak {b}=\max \limits _{b'\in H_\mathfrak {b}}{\{Est(\mathfrak {b}')\}}\). Notice that the top element \(\mathfrak {b}_{top}\) in the primary heap have the largest len value, but may not has the largest Est value. So we can not directly check \(r_{\mathfrak {b}_{top}}< \frac{2}{2+\epsilon }r_\mathfrak {b}\) to decide whether \(\mathfrak {b}\) is split fine enough. Fortunately, the len and Est value of a box have certain quantity relationships, which is formalized into the following lemma.

Lemma 5

For the MCB \(\mathfrak {b}\) of any point set P where \(|\mathfrak {b}|\ge 2\), we have \(len(\mathfrak {b})\le Est(\mathfrak {b})\le d\cdot len(\mathfrak {b})\). In the situation that \(|\mathfrak {b}|=1\), we redefine \(len(\mathfrak {b})\) as \(len(\mathfrak {b})=Est(\mathfrak {b})\) to make this inequality consistent.

Proof

See the full paper [18] for the proof.    \(\square \)

With the help of Lemma 5, we have the following Lemma 6 about the criteria for deciding whether a box is split fine enough.

Lemma 6

For box \(\mathfrak {b}\) and its primary heap \(H_\mathfrak {b}\), if the top element \(\mathfrak {b}_{top}\) satisfies \(len(\mathfrak {b}_{top})< \frac{2}{(2+\epsilon )d}len(\mathfrak {b})\), then \(rmax_\mathfrak {b}<\frac{2}{2+\epsilon } r_\mathfrak {b}\).

Proof

See the full paper [18] for the proof.    \(\square \)

The pseudo codes for building the box split tree are given in Algorithm 1. The algorithm also includes the invocation of Algorithm 2 aimed to maintain the Nbr sets, which will be introduced in the next section.

\({\varvec{Nbr}}\) Sets Maintaining. Algorithm 2 for maintaining \(Nbr(\mathfrak {b})\) is given below. It is invoked each time the main loop of Algorithm 1 is executed, as shown above.

figure b

There are two parts of Algorithm 2 that need to be explained in detail.

The first is Line 4. From Definition 8 for \(Nbr(\mathfrak {b})\), we can see that the maintaining of \(Nbr(\mathfrak {b})\) is based on the value \(rmax_{\mathcal {B}_{\mathfrak {b}_\mathrm {s}}}\) passed down by its super-box \(\mathfrak {b}_\mathrm {s}\). In Algorithm 2, Line 4 is aimed for updating \(rmax_{\mathfrak {b}_\mathrm {s}}\) when the set of sub-boxes of \(\mathfrak {b}_\mathrm {s}\) is changed. If \(Nbr(\mathfrak {b})\) is implemented as a heap, then whenever any sub-box of \(\mathfrak {b}_\mathrm {s}\) needs \(rmax_{\mathcal {B}_{\mathfrak {b}_\mathrm {s}}}\), this value can be retrieved from the heap in constant time.

The other part is the second foreach loop in Algorithm 2. The functionality of the loop is explained in the following Lemma 7.

Lemma 7

The second foreach loop ensures that for all box \(\mathfrak {b}\) in the box split tree T, each \(\mathfrak {b}'\in Nbr(\mathfrak {b})\) is either in the same level with \(\mathfrak {b}\), or a degenerated box containing only one point.

Proof

See the full paper [18] for the proof.    \(\square \)

3.2 Query

The query algorithm goes down the tree T returned by Algorithm 1 level by level. At each level of T, the algorithm \(\mathcal {A}\) for (cr)-NN will be invoked, and the input parameters of \(\mathcal {A}\) are set according to Lemma 3. The pseudo codes are given in Algorithm 3.

figure c

We should spend some efforts to explain the termination condition in Algorithm 3. We introduce a lemma about \(Nbr(\mathfrak {b})\) when \(|\mathfrak {b}|=1\).

Lemma 8

For a box \(\mathfrak {b}\) satisfying \(|\mathfrak {b}|=1\), all the boxes \(\mathfrak {b}'\in Nbr(\mathfrak {b})\) contain only one point, i.e., \(|\mathfrak {b}'|=1\).

Proof

See the full paper [18] for the proof.    \(\square \)

According to the above lemma, when the WHILE loop breaks, all boxes in \( Nbr(\mathfrak {b}_c)\) contains only one point. Thus the brute-force search takes \(O(|Nbr(\mathfrak {b}_c)|)\) time. We will bound this complexity in the next section.

4 Analysis

4.1 Correctness

We prove the correctness of our algorithm by introducing the following lemma.

Lemma 9

In every execution of the loop body, Algorithm 3 ensures that the exact nearest neighbor \(p^*\in P_c\) after the assignment of \(P_c\) (Line 8).

Proof

See the full paper [18] for the proof.    \(\square \)

Theorem 1

(Correctness). The point \(p'\) returned by Algorithm 3 is an \(\epsilon \)-NN to q in P, i.e., if \(p^*\) is the exact NN to q in P, then \(D(q,p')\le (1+\epsilon ) D(q,p^*)\).

Proof

See the full paper [18] for the proof.    \(\square \)

4.2 Complexities

Before we bound the complexity of our algorithm, we should first bound the size of \(Nbr(\mathfrak {b})\) for any box \(\mathfrak {b}\) by introducing a lemma from [20].

Lemma 10

([20]). Let r be a positive number. During the execution of the split method described in Sect. 3.1, at each time before splitting a box, let \(\mathcal {B}\) be the current box collection, and let \(\mathfrak {b}_L\) be the box with the largest volume in \(\mathcal {B}\). For any box \(\mathfrak {b}\in \mathcal {B}\), the size of the set \(\{\mathfrak {b}'\in \mathcal {B} \mid {D_{min}(\mathfrak {b},\mathfrak {b}')\le r\cdot Est(\mathfrak {b}_L) }\}\) is at most \(2^d (2d\lceil r\rceil +3)^d\).

Based on the lemma above, we can bound the size of \(Nbr(\mathfrak {b})\) for any box \(\mathfrak {b}\) in the box split tree T constructed in Algorithm 1.

Lemma 11

The size of \(Nbr(\mathfrak {b})\) defined in Definition 8 and constructed in Algorithm 2 is \(O((\frac{d}{\epsilon })^d)\).

Proof

See the full paper [18] for the proof.    \(\square \)

We introduce and prove another lemma which is about the property of the box split tree T constructed in preprocessing phase.

Lemma 12

For a point set P where \(|P|=n\), the fully built split tree T constructed based on P has the following properties:

  1. 1.

    There are at most 2n nodes in T.

  2. 2.

    The total time to build T is \(O(dn\log {n})\).

Proof

See the full paper [18] for the proof.    \(\square \)

Now we start to prove the complexities of our algorithm, including preprocessing time, space and query time complexities.

Theorem 2

(Preprocessing Time Complexity). The complexity of Algorithm 1 for preprocessing is \(O(O((\frac{d}{\epsilon })^d\cdot n\log {n}))\).

Proof

See the full paper [18] for the proof.    \(\square \)

Theorem 3

(Space Complexity). The space complexity of Algorithm 1 is \(O((\frac{d}{\epsilon })^d\cdot n)\).

Proof

See the full paper [18] for the proof.    \(\square \)

Theorem 4

(Query Time Complexity). Algorithm 3 invokes \(O(\log {n})\) times of the algorithm \(\mathcal {A}\) for (cr)-NN problem.

Proof

See the full paper [18] for the proof.    \(\square \)

5 Conclusion

In this paper we proposed a new algorithm for reducing \(\epsilon \)-NN problem to (cr)-NN problem. Compared to the former works for the same reduction problem, our algorithm achieves the lowest query time complexity, which is \(O(\log {n})\) times of invocations of the algorithm for (cr)-NN problem. We elaborately designed the input parameters of each of the invocation, and built a dedicated data structure in preprocessing phase to support the query procedure. A box split method proposed in [20] is used as a building block for the algorithm of preprocessing phase. Our paper also raises a problem which is to reduce the exponential complexity on d introduced by the box split method. This is left as our future work.