1 Introduction

Given two object sets \(P\) and \(Q\), a k-closest pair \((k\hbox {CP})\) query finds \(k\) closest object pairs from \(P\times Q\) according to a certain similarity metric (e.g., \(L_1\)-norm, \(L_2\)-norm, \(L_{\infty }\)-norm, edit distance). Consider, for example, Fig. 1, in which \(P=\{p_1, p_2, p_3\}\) and \(Q=\{q_1,q_2,q_3,q_4\}\). Assume that the similarity metric between two objects is the Euclidean distance (i.e., \(L_2\)-norm), the final result of a 2CP \((k=2)\) query is \(\{\langle p_2, q_2\rangle ,\,\langle p_2, q_1 \rangle \}\). The \(k\hbox {CP}\) query has received considerable attention from the database community, due to its importance in a wide spectrum of applications, such as GIS [17, 18, 24, 39, 40], data mining [6, 21, 30], and so forth. We give two representative examples here.

Fig. 1
figure 1

Example of a \(k\hbox {CP}\) query

Application 1 \((GIS)\) \(\hbox {M}k\hbox {CP}\) queries are helpful in many decision support and demanding data-handling problems. Considering efficient scheduling of tourist facilities, \(\hbox {M}k\hbox {CP}\) retrieval can be employed to retrieve \(k\) closest pairs of cultural landmarks and populated places. In this query, the traveling distance should take into account the underlying road network, i.e., it equals to the length of the shortest path connecting the cultural landmark and the populated place.

Application 2 (Data Mining) \(\hbox {M}k\hbox {CP}\) search can be used as a fundamental building block for a large number of data mining tasks, such as clustering [21, 30], outlier detection [6], and so on. For instance, many clustering algorithms (e.g., Chameleon, C2P) can be improved by including \(\hbox {M}k\hbox {CP}\) retrieval as a primitive operation. It is worth mentioning that, in some clustering applications [2, 34], the similarity of clusters may be edit (a.k.a. Leven-shtein) distance.

In view of flexible distance metrics in the above scenarios, e.g., network distance, edit distance, it requires more generic model than that specially used for a specific distance measurement. Consequently, in this paper, we aim at efficient methods for \(k\hbox {CP}\) retrieval in metric spaces, which requires no detailed representations of objects and can support any similarity metric satisfying the triangle inequality.

However, as to be discussed in Sect. 2, the existing solutions for \(k\hbox {CP}\) queries are insufficient, due to three reasons below. First, most approaches [17, 18, 25, 35, 40, 45, 52] are applicable only to the Euclidean space, where \(k\hbox {CP}\) search can be accelerated by exploiting various geometric properties (such as MBR [17, 18, 40] and plane-sweep [18, 35, 40]) to effectively prune the search space. Unfortunately, these geometric characteristics employed in the Euclidean space cannot be applied to the generic metric space, since complex data (e.g., strings, documents) do not have obvious Euclidean modeling. Second, existing similarity join algorithms in general metric spaces [22, 27, 32, 33, 42], which find object pairs with their distances within a distance threshold \(\epsilon \), cannot handle \(k\hbox {CP}\) search efficiently. This is because, it is difficult to choose a proper value of \(\epsilon \). For a smaller \(\epsilon \) value, similarity joins cannot return \(k\) closet object pairs, resulting in false missing; for a larger \(\epsilon \) value, similarity joins return more than \(k\) closet object pairs, incurring the significant query cost. Third, although there is little work [29, 32] on \(k\hbox {CP}\) queries in metric spaces, it is insufficient. Specifically, [32] only supports the join for two fixed datasets and achieves its efficiency for datasets with large overlap percentage, which limits its applicability. [29] utilizes the divide-and-conquer technique, which requires high computational cost for dividing, and its efficiency degrades rapidly as the scale of input datasets increases. Moreover, both methods only focus on main-memory techniques, which cannot be efficiently applied for the datasets not fitting in memory. Hence, to achieve the scalability of database applications, we assume that the datasets are built external-memory indexes (e.g., M-tree and PM-tree), as the indexes can further speed up search.

Motivated by these, in this paper, we study the problem of efficient \(k\hbox {CP}\) query processing in general metric spaces, namely Metric kCP \((\hbox {M}k\hbox {CP})\) search. Intuitively, a straightforward solution is to compare every object pair from two datasets. This approach, nevertheless, is very inefficient because, for every object in an object set \(P\), it needs to traverse another object set \(Q\) once, incurring \(huge\) I/O and CPU costs. In order to answer \(\hbox {M}k\hbox {CP}\) retrieval efficiently, two challenging issues have to be addressed.

Challenge I: How to minimize I/O overhead in terms of the number of node/page accesses? The I/O cost is an important metric in database search algorithms. We try to handle this issue from two aspects: avoiding unnecessary node accesses and reducing duplicated node accesses. We attempt to develop several effective pruning rules to avoid unqualified node accesses. In addition, we utilize LRU buffers and combine the best-first and depth-first traversal paradigms, in order to reduce duplicated node accesses.

Challenge II: How to minimize CPU cost in terms of the number of distance computations? In most applications, the distance is the dominant complexity measurement in metric spaces, and it is customary to just count the number of computed distances for comparing algorithms. Toward this, we exploit the properties of the metric space (e.g., the triangle inequality) and employ the aggressive pruning and compensation technique, in order to eliminate unqualified distance computations.

Based on these, we present, using the dynamic metric index M-tree [14], three algorithms for \(\hbox {M}k\hbox {CP}\) search with high-accuracy cost model. Note that, the proposed algorithms can also be extended to any other tree structure metric index (e.g., PM-tree [44]). Specifically, the first two algorithms follow depth-first and best-first traversal paradigms, respectively, and utilize pruning rules to efficiently handle \(\hbox {M}k\hbox {CP}\) queries. To further improve query efficiency, the third algorithm follows a hybrid traversal paradigm, which combines depth-first and best-first strategies, and uses the aggressive punning and compensation technique based on an estimated \(k\hbox {CP}\) distance value. In addition, we extend our techniques to tackle two natural variants of \(\hbox {M}k\hbox {CP}\) queries, i.e., (1) Self \(\hbox {M}k\hbox {CP}\,(\hbox {SM}k\hbox {CP})\) search, which performs \(\hbox {M}k\hbox {CP}\) retrieval on a single dataset and (2) Approximate \(\hbox {M}k\hbox {CP}\,(\hbox {AM}k\hbox {CP})\) search, which trades the quality of the \(\hbox {M}k\hbox {CP}\) query result for the search time.

In brief, the key contributions of this paper are summarized as follows:

  • We propose several pruning rules, use the aggressive pruning and compensation technique, and combine best-first and depth-first traversal paradigms, in order to reduce I/O and CPU costs.

  • We develop three efficient algorithms for processing the \(\hbox {M}k\hbox {CP}\) query and then analyze their corresponding correctness and I/O overhead.

  • We derive a node-based cost model for \(\hbox {M}k\hbox {CP}\) retrieval using M-trees.

  • We extend our techniques to handle two interesting variants of \(\hbox {M}k\hbox {CP}\) queries efficiently.

  • Extensive experiments with both real and synthetic data sets demonstrate the performance of our proposed algorithms, the effectiveness of our presented pruning rules, and the accuracy of our derived cost model.

The rest of this paper is organized as follows. Sect. 2 reviews related work. Section 3 formalizes the problem and reveals its characteristics, presents the Count M-tree (COM-tree) used to speed up \(\hbox {M}k\hbox {CP}\) search, and proposes a series of pruning rules. Section 4 elaborates \(\hbox {M}k\hbox {CP}\) query processing algorithms based on COM-trees and devises a cost model for \(\hbox {M}k\hbox {CP}\) retrieval. Section 5 extends our techniques to solve two interesting \(\hbox {M}k\hbox {CP}\) query variations. Considerable experimental results and our findings are reported in Sect. 6. Finally, Sect. 7 concludes the paper with some directions for future work.

2 Related work

In this section, we overview the existing work related to \(\hbox {M}k\hbox {CP}\) queries, including \(k\hbox {CP}\) queries in Euclidean spaces, the M-tree, and querying metric spaces.

2.1 Euclidean \(k\hbox {CP}\) queries

The existing algorithms for \(k\hbox {CP}\) retrieval are mainly under the Euclidean space, which can be classified into two categories, namely the incremental algorithms and the non-incremental algorithms.

The first one [25, 39, 40] is the incremental alternative, which returns the result in ascending order of distances in a pipeline fashion. Hjaltason and Samet [25] present the incremental algorithms that can be used with a large class of hierarchical spatial data structures. Shin et al. [39, 40] improves the incremental \(k\)-distance join algorithm, by utilizing bidirectional node expansion and plane-sweep techniques for fast pruning of distant pairs, and the plane-sweep is further optimized by novel strategies for selecting a good sweeping axis and direction. In addition, an adaptive multistage algorithm is developed for the incremental distance join when \(k\) is not known in advance.

The second one [17, 18, 52] is the non-incremental alternative, which assumes that \(k\) is known in advance, and reports the result all together at the end. Corral et al. [17, 18] propose one pruning heuristic and two updating strategies for minimizing the pruning distance. Based on the pruning heuristic, three non-incremental branch-and-bound external-memory algorithms are developed, with two of them following the depth-first search strategy and one obeying the best-first policy. The plane-sweep technique and access ordering are used to further improve query efficiency. Yang and Lin [52] present a new index structure called bichromatic-Rdnn (b-Rdnn) tree, which utilizes the information about nearest neighbors to process \(k\hbox {CP}\) queries efficiently. In particular, the b-Rdnn tree built on an object set \(P\) preserves, for each object in \(P\), its nearest neighbor in an answer object set \(Q\). However, to build the b-Rdnn tree on \(P\), we need to know \(Q\) in advance, which limits its applicability.

Recently, Cheema et al. [9] propose a unified method for top-\(k\) pairs, which allows the users to define a local scoring function for each attribute, and a global scoring function that computes the final score of each pair by combining its scores on different attributes. Nonetheless, the framework relies on the local scoring function defined on every attribute, which does not exist in generic metric spaces. Roumelis et al. [35] propose a new plane-sweep \(k\hbox {CP}\) search algorithm, termed Reverse Run Plane-Sweep, which minimizes the Euclidean and sweeping axis distance computations.

In addition, many variants of \(k\hbox {CP}\) queries have been investigated in the literature. In order to further reduce \(k\hbox {CP}\) query cost, some work [3, 20, 45] focuses on approximate \(k\hbox {CP}\) search, which forsakes some precision of the query result in exchange for improved efficiency. Angiulli and Pizzuti [3] study the approximate algorithm to calculate the top-\(k\) closest pairs join query, which employs the space-filling curve to establish an order between the points in the space, and performs at most \((d+1)\) sorts and scans of the two data sets, where \(d\) denotes the dimensionality. Corral and Vassilakopulos [20] apply a combination of the approximation techniques, e.g., \(\alpha \)-approximate, N-consider or Time-consider in the same query algorithm, i.e., hybrid approximation scheme. Tao et al. [45] utilize LSB technique to solve closest pair search in high dimensional space, which can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. Other interesting \(k\hbox {CP}\) query variants have also been well studied, such as self \(k\hbox {CP}\) queries [18], constrained \(k\hbox {CP}\) search [31, 38], exclusive \(k\hbox {CP}\) retrieval [48], non-index-based \(k\hbox {CP}\) search [24], top-\(k\) set similarity join [51], \(k\hbox {CP}\) queries on moving objects [4, 54] and spatial networks [10], respectively, cost models for \(k\hbox {CP}\) queries [19], performance comparisons of \(k\hbox {CP}\) queries [16, 28], to name but a few.

Nonetheless, all the above approaches are unsuitable for \(\hbox {M}k\hbox {CP}\) search because, for boosting the query, they make use of some geometric properties (e.g., MBR [17, 18, 39, 40], plane-sweep [18, 35, 39, 40], and space-filling curve [3, 45]) that are not available in generic metric spaces.

2.2 The M-tee

Many indexing technologies in metric spaces have been proposed [8, 26, 36, 44]. However, following most approaches in the relevant literature [11, 46, 49, 53], we assume, in this paper, M-tree [14], an external-memory metric index, is used as an underlying index structure. The M-tree is a balanced tree, and it can handle dynamic operations with reasonable costs, without requiring periodical restructuring.

Figure 2 depicts an example of M-tree for an object set \(O=\{o_1,o_2,\ldots ,o_9\}\). An intermediate (i.e., a non-leaf) entry \(e\) in a root node (e.g., \(N_0\)) or an intermediate node (e.g., \(N_1,\,N_2\)) records: (1) A routing object \(e.RO\), which is a selected object in the subtree \(ST_e\) of \(e\). (2) A covering radius \(e.r\), which is the maximal distance between \(e.RO\) and the objects in \(ST_e\). (3) A parent distance \(e.\textit{PD}\), which equals the distance from \(e.RO\) to the routing object of the parent entry \(e_p\). Since a root entry \(e\) (e.g., \(e_6\)) has no any parent entry, \(e.\textit{PD}=\infty \). (4) An identifier \(e.ptr\), which corresponds to the address of the root node of \(ST_e\). For example, in Fig. 2, a non-leaf entry \(e_4\) is associated with the routing object \(o_8\,(=e_4.RO)\), its covering radius \(e_4.r\) is \(r_4\), parent distance \(e_4.\textit{PD}\) corresponds to the distance \(d(o_8,\,o_6)\,(=d(e_4.RO,\,e_6.RO))\), and identifier \(e_4.ptr=N_6\). On the other hand, a leaf entry \(o\) in a leaf node (e.g., \(N_3,\,N_4,\,N_5,\,N_6\)) records: (1) An object \(o_j\), which stores the detailed information of \(o\). (2) An identifier \(oid\), which represents \(o\)’s identifier. (3) A parent distance \(o.\textit{PD}\), which equals the distance between \(o\) and the routing object in \(o\)’s parent entry. For instance, as shown in Fig. 2, the parent distance \(o_9.\textit{PD}\) of the object \(o_9\) is \(r_4\,(=d(o_8,\,o_9))\), while \(o_8.\textit{PD}=0\) because \(o_8\) is the routing object of its parent entry \(e_4\).

Fig. 2
figure 2

Example of an M-tree. a The dataset placement. b The M-tree

2.3 Querying metric spaces

A metric space is a space with an associated distance metric, which obeys certain simple properties. Several spatial queries in metric spaces have been investigated.

Except for Euclidean \(k\hbox {CP}\) queries, two alternatives for \(\hbox {M}k\hbox {CP}\) search are explored in [29, 32]. Paredes and Reyes [32] answer \(\hbox {M}k\hbox {CP}\) retrieval based on a new metric index, termed coined List of Twin Clusters (LTC), which maintains two lists of overlapping clusters to index both object sets jointly. Thus, it is limited for the case when two object sets come from different datasets and have their own metric index independently. In addition, it only achieves the efficiency for the object sets with large overlap percentage, as to be verified in Sect. 6.2. Kurasawa et al. [29] propose a divide-and-conquer-based \(k\hbox {CP}\) query method in metric spaces, called Adaptive Multi-Partitioning (AMP), which repeatedly divides and conquers the objects from the sparser distance distribution space, and speeds up the convergence of the upper bound distance before partitioning the denser space. Each time, AMP randomly chooses a pivot \(o\) to divide the object set \(P\) into \(n\) regions: \(P_0=\{p\in P| 0 \le d(o, p) < t_0\},\,P_1 = \{p \in P|t_0 \le d(o, p) < t_1\}, \ldots , P_i =\{p \in P | t_{i-1} \le d(o, p) < t_i\}, \ldots , P_n = \{p \in P | t_{n-1} \le d(o, p) < t_n\}\), as \(|t_i - t_{i-1}| > u\) and \(u\) is the upper bound distance of \(k\hbox {CP}\). Similarly, \(Q\) can also be partitioned by the pivot \(o\), as shown in Fig. 3. Although AMP can avoid distance computations searching in \(\langle P_i, Q_j \rangle \) with \(|j-i|>1\), the CPU cost required for dividing is still high, which increases rapidly with the scale of input datasets, as to be verified in our experiments. Moreover, the aforementioned two approaches only aim at in-memory techniques and hence cannot be applied to support external-memory \(\hbox {M}k\hbox {CP}\) query processing efficiently.

Fig. 3
figure 3

Illustration of Adaptive Multi-Partitioning (AMP)

Similarity search in metric spaces, including range and nearest neighbor (NN) queries, has been well studied and summarized in [8, 26, 36]. Ciaccia and Patella [13] propose an approximation algorithm for \(k\hbox {NN}\) search, in which the error bound \(\varepsilon \) can be exceeded within a certain probability \(\delta \) using the distance distribution of the query point. Zezula et al. [53] introduce three different approximation methods over M-trees, with the relative error is not bounded by any function of the input parameters. In addition, cost models [5, 12, 15, 47] are also derived for metric similarity queries. More recently, Vlachou et al. [49] present a framework for distributed metric similarity search, where each participating peer preserves its own data using an M-tree.

A similarity join retrieves the object pairs with their distances bounded by a distance threshold \(\epsilon \). This operation has also been well studied in metric spaces, and there are many efficient solutions, as overviewed in [27]. Recently, Paredes and Reyes [32] handle similarity joins using LTC, which indexes both sets jointly. Silva and Pearson [33, 41, 42] develop a non-blocking similarity join database operator DBSimJoin, and study index-based similarity joins. Fredriksson and Braithwaite [22] improve the Quicksort algorithm [27] for similarity joins. In addition, similarity joins using Map-Reduce have also been studied [37, 50]. However, all the above solutions cannot handle \(\hbox {M}k\hbox {CP}\) search efficiently. This is because, it is difficult to choose a proper value of \(\epsilon \). For a smaller \(\epsilon \) value, similarity join cannot return \(k\) closet object pairs, resulting in false missing. For a larger \(\epsilon \) value, similarity join returns more than \(k\) closet object pairs, incurring the significant query cost. As an example, we have to run the similarity join algorithm twice to obtain the accurate result set [22], which is costly.

In addition, two other spatial queries, i.e., reverse \(k\hbox {NN}\) and skyline queries, are also studied in metric spaces. In particular, metric reverse \(k\hbox {NN}\) search [1, 46] finds the objects in a given object set that have a specified query object \(q\) as one of their \(k\) NNs, and metric skyline query [11, 23, 43] retrieves the objects not dominated by any other object with respect to all query objects in a generic metric space.

3 Preliminaries

In this section, we first formally define metric \(k\hbox {CP}\,(\hbox {M}k\hbox {CP})\) search, and then, we introduce the count M-tree (COM-tree), and present several pruning rules and lemmas that can facilitate the development of efficient \(\hbox {M}k\hbox {CP}\) search algorithms. Table 1 summarizes the notations used frequently throughout this paper.

Table 1 Symbols and description

3.1 Problem formulation

A metric space is a tuple \((D,\,d)\), where \(D\) is the domain of feature values, and \(d\) is a distance function used to compare objects in \(D\). The distance function must satisfy the four properties below: (1) symmetry: \(d(p,\,q)=d(q,\,p)\); (2) non-negativity: \(d(p, q)\ge 0\); (3) identity: \(d(p,q)=0\) iff \(p=q\); and (4) triangle inequality: \(d(p,q)\le d(p,o)+d(o,q)\). Based on these properties, we formalize the \(\hbox {M}k\hbox {CP}\) query.

Definition 1

(\({ M}k{ CP}\) Search). Given two object sets \(P\) and \(Q\) in a generic metric space, and an integer \(k\,(1 \le k \le |P| \times |Q|)\), a metric k-closest pair \((\hbox {M}k\hbox {CP})\,query\) finds \(k\) ordered different closest object pairs from \(P\times Q\), i.e., \(MkCP(P,\,Q)=\{\langle p_1, q_1 \rangle , \langle p_2, q_2 \rangle , \ldots , \langle p_k, q_k \rangle \vert p_1\), \(p_2,\ldots , p_k \in P, q_1, q_2,\ldots ,q_k \in Q\), \(\langle p_i, q_i \rangle \ne \langle p_j, q_j \rangle , i \ne j, 1 \le i, j \le k\), and \(\forall (p', q') \in P \times Q - \{\langle p_1, q_1 \rangle , \langle p_2, q_2 \rangle , \ldots , \langle p_k, q_k \rangle {\}}\) such that \(d(p', q') \ge d(p_k, q_k) \ge \cdots \ge d(p_1, q_1)\}\).

Consider two English word sets \(P\) = {“till,” “thrill”} and \(Q\) = {“ill,” “doll,” “nila”}, and suppose the edit distance is used to measure the similarity between two words. If \(k=2,\,M2CP(P,\,Q\)) = {\(\langle \) “till,” “ill” \(\rangle \), \(\langle \)“till,” “ doll” \(\rangle \)}.

Based on Definition 1, \(MkCP(P,\,Q)\) may be not unique due to the distance ties. However, the goal of our proposed algorithms was to find one possible instance. Thus, we randomly choose object pairs when the distance ties occur.

In this paper, we study the problem of \(\hbox {M}k\hbox {CP}\) retrieval. For simplicity, in all the illustrative examples used in the rest of this paper, we assume that the metric distance function is \(L_2\)-norm (i.e., Euclidean distance). In order to minimize the query cost, we introduce the COM-tree, and based on which we develop two pruning heuristics, as described in Sects. 3.2 and 3.3, respectively.

3.2 The count M-tree

To enable the search space pruning, we propose a variant of M-tree, termed COUNT M-tree (COM-tree), which includes \(e.num\) in each intermediate entry \(e\) to represent the number of the objects contained in \(ST_e\). For ease of understanding, the COM-tree on the object set depicted in Fig. 2a is illustrated in Fig. 4, where the number associated with every non-leaf entry denotes \(e.num\). For instance, \(e_6.num=4\) as \(ST_{e_6}\) contains four objects \(o_6,\,o_7,\,o_8,\,o_9\), and \(e_4.num=2\) since \(ST_{e_4}\) includes two objects \(o_8\) and \(o_9\). Note that, these digits are computed and stored during the construction of COM-tree.

Fig. 4
figure 4

Example of a COM-tree

As \(\hbox {M}k\hbox {CP}\) search acts on the entry pairs from the COM-trees over \(P\) and \(Q\) (i.e., \(M_P\) and \(M_Q\)), the minimal distance and the maximal distance between entry pairs, which can accelerate the search space pruning, are defined as follows.

Definition 2

(Mindist). Given two intermediate entries \(E_P\) and \(E_Q\), \(mindist(E_P,\,E_Q)=\hbox {max}\{d(E_P.RO,\,E_Q.RO)-E_P.r-E_Q.r,0\}\).

Definition 3

(Maxdist). Given two intermediate entries \(E_P\) and \(E_Q\), \(maxdist(E_P,\,E_Q)=d(E_P.RO,\,E_Q.RO)+E_P.r+ E_Q.r\).

Consider, for example, Fig. 5, in which the red solid line represents \(mindist(E_P, E_Q)\), and the red dotted line denotes \(maxdist(E_P, E_Q)\). Based on Definitions 2 and 3, \(mindist(E_P,E_Q)\) and \(maxdist(E_P, E_Q)\) offer the lower and upper bounds of the distances between object pairs (from \(E_P\) and \(E_Q\)), respectively.

Fig. 5
figure 5

Example for \(mindist\) and \(maxdist\)

3.3 Pruning heuristics

Querying metric spaces is inherently more difficult than that in Euclidean spaces, due to the lack of geometric properties. In the sequel, we devise alternative pruning rules based on an intuitive observation: an entry pair can be pruned if it cannot contain any qualified object pair.

Rule 1

Given two leaf or non-leaf entries \(E_P\) and \(E_Q\), if \(mindist(E_P,E_Q)>maxC\textit{PD}_k\), with \(maxC\textit{PD}_k\) denoting the upper bound distance of the \(k\hbox {th}\) closest pair, \(\langle E_P, E_Q\rangle \) can be discarded safely.

Proof

The proof is straightforward according to the definitions of \(mindist(E_P, E_Q)\) and \(maxC\textit{PD}_k\). \(\square \)

Consider the example shown in Fig. 6. Assume that current \(maxC\textit{PD}_k = 20,\,\langle E_{P3}, E_{Q3} \rangle \) can be discarded as \(mindist(E_{P3}, E_{Q3}) > maxC\textit{PD}_k\). However, for every Rule 1 application, it needs one distance computation to calculate \(mindist\). To this end, we give the definition of \(emindist\) in Definition 4, which can utilize the triangle inequality to avoid unnecessary distance computations.

Fig. 6
figure 6

Example for Lemmas 1 and 2

Definition 4

(Emindist). Given two intermediate entries \(E_P\) and \(E_Q\), and suppose \(PE_P\,(PE_Q)\) is the parent entry of \(E_P\,(E_Q)\), then \(emindist(E_P, E_Q)\) is defined as

$$\begin{aligned} \left\{ \begin{array}{l} d_{PE_{Q}} \\ d_{PE_{P}} \\ \max \{{{d}_{PE_Q}},{{d}_{PE_P}}\} \\ \end{array}\begin{array}{l} \hbox {only}\,d(E_P.RO, PE_Q.RO)\,\hbox {is\,\,known} \\ \hbox {only}\,d(E_Q.RO, PE_P.RO)\,\hbox {is\,\,known} \\ \hbox {both\,\,are\,\,known} \\ \end{array} \right. \end{aligned}$$

where \(d_{PE_Q}=|d(E_P.RO,PE_Q.RO)-E_Q.\textit{PD}|-E_P.r-E_Q.r\) and \(d_{PE_P}=|d(E_Q.RO, PE_P.RO) - E_P.\textit{PD}|-E_P.r-E_Q.r\), respectively.

Due to the triangle inequality, \(|d(E_P.RO,\,PE_Q.RO)-E_Q.\textit{PD}|\) and \(|d(E_Q.RO,\,PE_P.RO)-E_P.\textit{PD}|\) are no larger than \(d(E_P.RO,E_Q.RO)\), and thus, \(emindist(E_P,\,E_Q) \le mindist(E_P,\,E_Q)\), i.e., \(emindist\) provides a lower bound of \(mindist\). Take Fig. 6 as an example. If only \(d(E_Q.RO, PE_P.RO)\,(= d(q_3, p_1))\) is known, \(emindist\,(E_{P3},E_{Q3})=d(q_3,p_1)-d(p_1, p_6) - E_{P3}.r - E_{Q3}.r = 29.8\), which is smaller than \(mindist(E_{P3},\,E_{Q3})\). Based on Definition 4, we present a new pruning rule as follows.

Rule 2

Given two leaf or non-leaf entries \(E_P\) and \(E_Q\), if \(emindist(E_P,\,E_Q) > maxC\textit{PD}_k\), then \(\langle E_P,E_Q\rangle \) can be pruned away safely.

Proof

According to the definition of \(emindist\), if \(emindist\,(E_P,\,E_Q)>maxC\textit{PD}_k\), \(mindist(E_P,\,E_Q)\ge emindist\,(E_P,\,E_Q) > maxC\textit{PD}_k\) holds. Therefore, \(\langle E_P, E_Q \rangle \) can be discarded safely due to Rule 1. \(\square \)

An example of Rule 2 is depicted in Fig. 6. Assume that current \(maxC\textit{PD}_k=20\), \(\langle E_{P3},E_{Q3}\rangle \) can be pruned as \(emindist(E_{P3}, E_{Q3}) > maxC\textit{PD}_k\). Note that, although Rule 2 has weaker pruning power than Rule 1, it does not need any distance computation, as all the values required to compute \(emindist\) for Rule 2 are directly available.

To deploy Rule 1 and Rule 2, we must derive values of \(maxC\textit{PD}_k\). Although their values are not unique, we know that \(maxC\textit{PD}_k\) should be as small as possible to achieve strong pruning power. A simple way to obtain \(maxC\textit{PD}_k\) is to use the distance of the \(k\)th closest object pair retrieved so far during \(\hbox {M}k\hbox {CP}\) query processing. However, we can get a tight \(maxC\textit{PD}_k\) using intermediate entry pairs, in order to efficiently shrink the search space as early as possible.

Lemma 1

Given two intermediate entries \(E_P\) and \(E_Q\), if \(k=1\), \(maxC\textit{PD}_1\) can be set to \(d(E_P.RO, E_Q.RO)\); and if \(k > 1\), \(maxC\textit{PD}_k\) can be set to

$$\begin{aligned} \min \left\{ \begin{array}{l} d({{E}_{P}}.RO,{{E}_{Q}}.RO)\!+\!{{E}_{P}}.r \\ d({{E}_{P}}.RO,{{E}_{Q}}.RO)\!+\!{{E}_{Q}}.r \\ maxdist({{E }_{P}},{{E}_{Q}}) \\ \end{array}\begin{array}{l} E_{P}.num\!\ge \! k \\ E_{Q}.num\!\ge \! k \\ E_{P}.num\!\times \! {{E}_{Q}}.num\!\ge \! k \\ \end{array}\right. \end{aligned}$$

Proof

To prove this lemma, \(maxC\textit{PD}_{k}\) can be updated with \(r\) if there are \(k\) different object pairs having their distances bounded by \(r\).

For \(k=1\), since the routing object of an intermediate entry is a real object, there exists an object pair \(\langle E_{P}.RO,\,E_{Q}.RO\rangle \) with its distance bounded by \(d(E_{P}.RO,\,E_{Q}.RO)\). Thus, \(maxC\textit{PD}_{1}\) can be set to \(d(E_{P}.RO,\,E_{Q}.RO)\).

For \(k>1\), as shown in Fig. 5, there are (1) \(E_{P}\).num object pairs \(\langle p_{i},\,E_{Q}.RO\rangle \,(p_{i}\in E_{P})\) with their distances \(d(p_{i},\,E_{Q}.RO)\le d(E_{P}.RO,E_{Q}.RO)+d(p_{i},\,E_{P}.RO)\le d(E_{P}.\,RO,\,E_{Q}.RO)+E_{P}.r\) (according to the triangle inequality), (2) \(E_{Q}.num\) object pairs \(\langle q_{j},\,E_{P}.RO\rangle \) (\(q_{j} \in E_{Q}\)) with their distances bounded by \(d(E_{P}.RO,\,E_{Q}.RO)+E_{Q}.r\), and (3) \(E_{P}.num\times E_{Q}.num\) object pairs \(\langle p_{i},\,q_{j}\rangle \) with their distances bounded by \(maxdist(E_{P},\,E_{Q})\). Consequently, \(maxC\textit{PD}_{k}\) can be set as the corresponding minimum value according to \(k\), and the proof completes. \(\square \)

Consider the example illustrated in Fig. 6, where two intermediate entries \(E_{P2}\) and \(E_{Q2}\) with \(E_{P2}.num=2\) and \(E_{Q2}.num=2\). We can set \(maxC\textit{PD}_{4}\,(k=4)\) to \({ maxdist}(E_{P2},\,E_{Q2})\) based on Lemma 1. Hence, \(\langle E_{P3},\,E_{Q3}\rangle \) can be pruned by Rule 1, as \(mindist(E_{P3},\,E_{Q3})> maxC\textit{PD}_{4}\); whereas \(\langle E_{P2}\), \(E_{Q3}\rangle \) can not be discarded due to \(mindist(E_{P2},\,E_{Q3})<maxC\textit{PD}_{4}\). Note that, Lemma 1 considers only one intermediate entry pair. If we take into account all subentry pairs in the same parent entry pair, the value of \(maxC\textit{PD}_{k}\) can be tighter, as stated in Lemma 2.

Lemma 2

Given two sets of intermediate entries \(E_{Pi}\,(1\le i \le m)\) and \(E_{Qj}\,(1 \le j \le n)\), which are subentries of the parent entries \(PE_P\) and \(PE_Q\), respectively, and we sort \(d(E_{Pi}.RO,E_{Qj}.RO)\) and \(maxdist(E_{Pi}, E_{Qj})\) in ascending order, and then get two ordered sequences \(d(E_{Pt}.RO,\,E_{Qt}.RO)\) and \(maxdist(E_{Pt}\), \(E_{Qt})\) (\(1 \le t \le mn\)). If \(k>1\), \(maxC\textit{PD}_k\) can be set to

$$\begin{aligned} \min \left\{ \begin{array}{ll} d(E_{Pk}.RO,E_{Qk}.RO) &{} mn\ge k \\ maxdist(E_{Px},E_{Qx}) &{} \sum \limits _{1 \le t \le x \le mn} E_{Pt}.num \times E_{Qt}.num\ge k \end{array}\right. \end{aligned}$$

Proof

Let \(r\) be the minimum between \(d(E_{Pk}.RO,E_{Qk}.RO)\) and \(maxdist(E_{Px},\,E_{Qx})\). Similar as Lemma 1, to prove this lemma, we need to find \(k\) different object pairs with their distances bounded by \(r\).

Since the routing object of an intermediate entry is a real object and \(d(E_{Pt}.RO,\,E_{Qt}.RO)\,(1\le t\le mn)\) is sorted in ascending order, there are \(k\) object pairs \(\langle E_{Pt}.RO, E_{Qt}.RO \rangle \,(1\le t \le k)\) with their distances bounded by \(d(E_{Pk}.RO,E_{Qk}.RO)\) if \(mn \ge k\); and if the \(x\)th \((x\le mn)\,maxdist(E_{Px},\,E_{Qx})\) satisfies the fact that the total number of object pairs contained in \(\langle E_{Pt}\), \(E_{Qt}\rangle (1 \le t \le x)\) is larger than \(k\), there exist at least \(k\) object pairs with their distances bounded by \(maxdist(E_{Px},\,E_{Qx})\). Therefore, \(maxC\textit{PD}_{k}\) can be set as the corresponding minimum value according to \(k\), and the proof completes. \(\square \)

Back to the example shown in Fig. 6, where \(E_{Pi}(1\le i\le 3)\) and \(E_{Qj}\,(1\le j\le 3)\) are subentries of \(PE_{P}\) and \(PE_{Q}\), respectively. To utilize Lemma 2, we sort \(d(E_{Pi}.RO,\,E_{Qj}.RO)\) and \(maxdist(E_{Pi},\,E_{Qj})\) in ascending order, and then obtain two ordered sequences \(d(p_{4},\,q_{5})\), \(d(p_{1},\,q_{5})\), \(d(p_{4},\,q_{1})\), \(d(p_{1},\,q_{1}), \ldots , d(p_{6},\,q_{3})\); \({ maxdist}(E_{P2},\,E_{Q2})\), \({ maxdist}(E_{P2},\,E_{Q1})\), \(maxdist(E_{P1},\,E_{Q2}),\ldots ,maxdist(E_{P3},\,E_{Q3})\). Then, \(maxC\textit{PD}_{4}\) can be set to the fourth distance \(d(p_{1}\), \(q_{1})\) that is smaller than \(maxdist(E_{P2},\,E_{Q2})\). Thus, \(\langle E_{P2},\,E_{Q3}\rangle \) can be pruned by Rule 1, as \(mindist(E_{P2},\,E_{Q3})>d(p_{1},\,q_{1})\) holds, but it cannot be discarded if we only employ Lemma 1 to update \(maxC\textit{PD}_{4}\).

When expanding \(\langle PE_{P}\), \(PE_{Q}\rangle \), its subentry pairs \(\langle E_{Pi}\), \(E_{Qj}\rangle \) are processed one by one. In order to utilize Lemma 2 incurring no additional distance computations, instead of calculating \(maxC\textit{PD}_{k}\) using subentry pairs all at once, we update \(maxC\textit{PD}_{k}\) gradually with processed subentry pairs whose mindist are already computed. This is because both \(d(E_{Pi}.RO,\,E_{Qj}.RO)\) and \(maxdist(E_{Pi},\,E_{Qj})\) can be easily obtained when computing \(mindist(E_{Pi},\,E_{Qj})\).

4 \(\hbox {M}k\hbox {CP}\) query processing

In this section, we elaborate three efficient algorithms for \(\hbox {M}k\hbox {CP}\) search, assuming that \(P\) and \(Q\) are indexed by COM-trees, and then present a cost model for \(\hbox {M}k\hbox {CP}\) retrieval. In the sequel, a running example (shown in Fig. 7) is employed to facilitate the understanding of different \(\hbox {M}k\hbox {CP}\) search algorithms. Specifically, Fig. 7 shows the COM-trees \(M_{P}\) and \(M_{Q}\) for the object sets \(P=\{p_{1},p_{2},\ldots ,p_{9}\)} and \(Q=\{q_{1},q_{2},\ldots ,q_{9}\}\), respectively, and the final result set \(S_{R}=\{\langle p_{5},q_{9}\rangle ,\langle p_{4}, q_{9}\rangle \}\) for an M2CP \((k=2)\) query.

Fig. 7
figure 7

A running example for an M2CP \((k=2)\) query. a The placement of \(P\) and \(Q\). b The COM-trees

4.1 Recursive \(\hbox {M}k\hbox {CP}\) algorithm

Based on the pruning rules and lemmas presented in Sect. 3, we propose, as depicted in Algorithm 1, Recursive MkCP Algorithm (RMA) for \(\hbox {M}k\hbox {CP}\) retrieval, which follows a depth-first paradigm. First of all, \(maxC\textit{PD}_{k}\) is set as infinity (line 1). Then, RMA updates \(maxC\textit{PD}_{k}\) using Lemmas 12, and prunes the root entry pairs (line 2). Next, it calls the function RMA-PEP for the root entry pair \(\langle E_{P}, E_{Q}\rangle \) not pruned in ascending order of its mindist until \(mindist(E_{P},\,E_{Q})>maxC\textit{PD}_{k}\) (line 3). Note that, if there is a tie of mindist, the intersection area between two entries serves as a tie-breaker, and the larger is better. However, it is also easy for RMA to use other tie-breakers (e.g., maxdist). Finally, the algorithm returns the result set \(S_{R}\) (line 4). Note that, \(S_R\) is maintained by using a max-heap, which keeps the \(k\) closest pairs found so far in descending order of their distances during \(\hbox {M}k\hbox {CP}\) search.

figure a

For each currently visited intermediate entry pair \(\langle E_{P},\,E_{Q}\rangle \) pointing to non-leaf nodes, RMA-PEP uses bidirectional node expansion technique to expand \(\langle E_{P},\,E_{Q}\rangle \) (lines 6–11). Initially, it initializes a local min-heap \(H\) storing the subentry pairs of \(\langle E_{P},\,E_{Q}\rangle \) based on ascending order of their mindist. In order to minimize the effects quadratic cost of checking the Cartesian product of \(\langle e_{Pi},\,e_{Qj}\rangle \,(e_{Pi} \in E_{P},\,e_{Qj} \in E_{Q})\), it invokes a function PRUNE, which removes from consideration (1) all \(e_{Pi} \in E_{P}\), for which \(emindist(e_{Pi},\,E_{Q})\) or \(mindist(e_{Pi},\,E_{Q})\) is larger than \(maxC\textit{PD}_{k}\) (lines 15–16), (2) all \(e_{Qj} \in E_{Q}\), for which \(emindist(e_{Qj},\,E_{P})\) or \(mindist(e_{Qj},\,E_{P})\) is larger than \(maxC\textit{PD}_{k}\) (lines 18–19). Note that, \(emindist(e_{Pi},\,E_{Q})= \vert d(E_{P}.RO,\,E_{Q}.RO)-e_{Pi}.\textit{PD}\vert -e_{Pi}.r-E_{Q}.r\) and \(emindist(E_{P},e_{Qj})=\vert d(E_{P}.RO,E_{Q}.RO)-e_{Qj}.\textit{PD}\vert -E_{P}.r-e_{Qj}.r\), as \(d(E_{P}.RO,E_{Q}.RO)\) is already known. For each remaining subentry pair \(\langle e_{Pi},\,e_{Qj}\rangle \), PRUNE adds it to \(H\), and updates \(maxC\textit{PD}_{k}\) using Lemmas 12, if \(\langle e_{Pi},\,e_{Qj}\rangle \) can not be pruned by Rules 12 (lines 20–23), where \(emindist(e_{Pi},\,e_{Qj})=\hbox {max}\{\vert d(e_{Pi}.RO,\,E_{Q}.RO)-e_{Qj}.\textit{PD}\vert ,\vert d(e_{Qj}.RO,\, E_{P}.RO)- e_{Pi}.\textit{PD}\vert \}- e_{Pi}.r-e_{Qj}.r\). Next, RMA-PEP is called recursively for every entry pair \(\langle e_{Pi},\,e_{Qj}\rangle \) in \(H\) until \(mindist(e_{Pi},\,e_{Qj})> maxC\textit{PD}_{k}\) or \(H=\emptyset \) (lines 8–11). At each newly visited intermediate entry pair \(\langle E_{P}\), \(E_{Q}\rangle \) pointing to leaf nodes, it calls PRUNE to update the result set \(S_{R}\) and \(maxC\textit{PD}_{k}\), respectively, using each qualified object pair (line 13).

We illustrate RMA using the running example depicted in Fig. 7, and please refer to Appendix A for details.

Lemma 3

The proposed algorithm RMA can return exactly the actual \(\hbox {M}k\hbox {CP}\) query result, i.e., the algorithm has no false negatives, no false positives, and the returned result set contains no duplicate objects.

Proof

First, no real answer object pairs are missed (i.e., no false negatives), as only unqualified (leaf and non-leaf) entry pairs are pruned by Rules 12. Second, assume, to the contrary, that a false answer object pair \(\langle p_i', q_j' \rangle \) is returned, then the returned \(C\textit{PD}_k'\) must be larger than the actual \(C\textit{PD}_k\). For any real answer object pair \(\langle p_i, q_j \rangle \), it cannot be pruned by Rules 12 due to \(d(p_i, q_j)<C\textit{PD}_k<C\textit{PD}_k' \le maxC\textit{PD}_k\), which can be used to update the result set. Hence, all the actual answer object pairs are returned, which contradicts with our assumption, and thus, no false positives is ensured. Third, no duplicate object pairs are guaranteed because, all the qualified (leaf and non-leaf) entry pairs are pushed into each local min-heap in ascending order of their \(mindist\hbox {s}\), and every entry pair is evaluated at most once and is popped right after evaluation. \(\square \)

Treatment for different tree heights If the COM-trees \(M_{P}\) and \(M_{Q}\) have different heights, RMA needs to process entry pairs at different levels. In general, there are two approaches for treating different heights, i.e., “fix-at-root” and “fix-at-leaf.” Following [18], in this paper, we take the“ fix-at-leaf” strategy, which processes intermediate entry pairs as usual. However, for leaf and intermediate entry pairs, it stops propagating downwards the leaf entry, while propagates downwards the other intermediate entry.

figure b

4.2 Iterative \(\hbox {M}k\hbox {CP}\) algorithm

In order to avoid recursion and visiting unnecessary entry pairs (e.g., \(\langle E_{P6},\,E_{Q3}\rangle \) in Example 1 as depicted in Appendix A), we develop Iterative MkCP Algorithm (IMA), as presented in Algorithm 2, which follows the best-first paradigm. In the first place, IMA initializes \(maxC\textit{PD}_{k}\) to infinity, and a global min-heap \(H\) to empty which stores the entry pairs in ascending order of mindist (line 1). Then, it updates \(maxC\textit{PD}_{k}\) using Lemmas 12, and meanwhile inserts qualified root entry pairs that cannot be pruned by Rule 1 into \(H\) (line 2). Thereafter, IMA evaluates iteratively every head entry pair \(\langle E_{P},\,E_{Q}\rangle \) of \(H\) having the smallest mindist until \(mindist(E_{P},\,E_{Q})>maxC\textit{PD}_{k}\) or \(H=\emptyset \) (lines 3–7). Specifically, for each top entry pair \(\langle E_{P},\,E_{Q}\rangle \) of \(H\), it determines whether \(mindist(E_{P},\,E_{Q})\) is larger than \(maxC\textit{PD}_{k}\). If yes, the algorithm stops, and returns the result set \(S_{R}\). Otherwise, IMA invokes the function IMA-PEP (line 7) to insert qualified subentry pairs into \(H\) (lines 9–10) or update the result set \(S_{R}\) (lines 10–11). Finally, IMA returns the result set \(S_{R}\) (line 8).

We illustrate IMA using the running example depicted in Fig. 7, and please refer to Appendix B for details.

The correctness proof for IMA is similar as that for RMA and thus omitted for space saving. Next, we analyze its I/O efficiency as follows.

Lemma 4

IMA only visit the intermediate entry pairs \(\langle E_{P},\,E_{Q}\rangle \) with \(mindist(E_{P},\,E_{Q})\le C\textit{PD}_{k}\) at most once.

Proof

Assume, to the contrary, that IMA visit an intermediate entry pair \(\langle E_{P},\,E_{Q}\rangle \) with \(mindist(E_{P},\,E_{Q})> C\textit{PD}_{k}\). Consider all the intermediate entry pairs \(\langle E_{P}',\,E_{Q}' \rangle \) that contain all answer object pairs \(\langle p,\,q\rangle \). We can get \(mindist(E_{P}',\,E_{Q}')\le d(p,\,q)\,\le C\textit{PD}_{k}\). Since IMA visits entry pairs in ascending order of mindists, all \(\langle E_{P}',\,E_{Q}' \rangle \) are visited before \(\langle E_{P},\,E_{Q}\rangle \), and hence, \(maxC\textit{PD}_{k}\) is updated to \(\textit{CPD}_{k}\) before accessing \(\langle E_{P},\,E_{Q}\rangle \). Therefore, \(\langle E_{P},\,E_{Q}\rangle \) can be pruned by Rule 1, which contradicts our assumption. In order to complete the proof, we need to show that an entry pair is not visited multiple times. This is straightforward because each entry pair is evaluated a single once and popped right after evaluation. \(\square \)

4.3 Estimation-based hybrid \(\hbox {M}k\hbox {CP}\) algorithm

As pointed out in [18], recursion favors LRU buffer(s) in the backtracking phase. Thus, although RMA could visit unnecessary entry pairs, it might achieve better I/O performance than IMA in the presence of LRU. To further reduce I/O cost under LRU, we present a hybrid traversal paradigm, which combines the best-first (BF) and depth-first (DF) strategies. In particular, it utilizes a stack \(S\) to access the intermediate entry pairs \(\langle E_{P},\,E_{Q}\rangle \) with \(mindist(E_{P},\,E_{Q})=0\) in the DF fashion, and maintains min-heaps to visit other intermediate entry pairs in the BF manner. Take Fig. 8 as an example. Figure 8b, c depict the contents of LRU buffers, for visiting the entry pairs (shown in Fig. 8a) with mindist = 0 in DF and BF paradigms, respectively. In this case, we use two LRU buffers for \(M_{P}\) and \(M_{Q}\), respectively, and red stars in each LRU buffer represent page faults. We can observe that it only needs 6 page faults for the DF paradigm, which is smaller than 9 for the BF paradigm. Note that, in real-life applications, it is likely that object sets \(P\) and \(Q\) overlap, resulting in lots of entry pairs having mindist = 0. Hence, the hybrid traversal paradigm might achieve the best I/O performance in most cases, which is also confirmed by our experiments (see Sect. 6.2).

Fig. 8
figure 8

Example of the I/O cost under LRU buffers. a \(E_P\) and \(E_Q\). b Depth-first. c Best-first

figure c

In addition, in RMA and IMA, \(max\textit{CPD}_{k}\) is initially set to infinity, and becomes smaller as the algorithms proceed. Nonetheless, the adaptation of \(max\textit{CPD}_{k}\) value has a crucial impact on the performance of \(\hbox {M}k\hbox {CP}\) retrieval, since \(max\textit{CPD}_{k}\) is employed by Rules 12 to prune the search space. If \(max\textit{CPD}_{k}\) approaches to the real \(\textit{CPD}_{k}\) slowly, which is true especially for larger \(k\), the early stage of the algorithms cannot efficiently shrink the search space. To address this, our third algorithm, namely Estimation-based Hybrid MkCP Algorithm (EHM), whose pseudo-code is shown in Algorithm 3. EHM introduces a new pruning metric \(e\textit{CPD}_{k}\), an estimation value of \(\textit{CPD}_{k}\), for aggressive pruning, and we defer the detailed discussion of \(e\textit{CPD}_{k}\) computation later. Specifically, in the aggressive pruning step, (1) \(max\textit{CPD}_{k}\), like RMA and IMA, is utilized to prune away unqualified entry pairs, and (2) \(e\textit{CPD}_{k}\) is used to further prune the search space. However, it will become too aggressive by choosing an underestimated value, i.e., \(e\textit{CPD}_{k}\) is smaller than \(\textit{CPD}_{k}\). To avoid any false dismissal, two min-heaps EH and CH are employed to store the entry pairs pruned by \(e\textit{CPD}_{k}\). After the aggressive pruning stage, EHM needs to search the final result set \(S_{R}\) in EH and CH for compensation.

figure d
figure e

Algorithm 4 depicts the pseudo-code of EHM-AP. Initially, EHM-AP updates \(max\textit{CPD}_{k}\) using Lemmas 12, and adds the qualified root entry pairs \(\langle E_{P},\,E_{Q}\rangle \) that are not discarded by Rule 1 to \(S\) or \(H\) or CH according to \(mindist(E_{P},\,E_{Q})\) (line 1). Then, it performs a while-loop (lines 2–4) to visit entry pairs in \(S\) until \(S\) is empty. Each time, the algorithm pops the head entry pair \(\langle E_{P},\,E_{Q}\rangle \) of \(S\), and calls the function EHM-PEP to expand \(\langle E_{P},\,E_{Q}\rangle \). Next, EHM-AP runs another while-loop (lines 5–9) to visit entry pairs in \(H\) until \(H\) is empty. Specifically, it deheaps the top entry from \(H\), and determines whether the early termination condition is satisfied. If yes, the algorithm stops, otherwise, it invokes EHM-PEP to expand \(\langle E_{P},\,E_{Q}\rangle \). In EHM-PEP, no matter whether the currently visited entry pair \(\langle E_{P},\,E_{Q}\rangle \) pointing to non-leaf nodes or leaf nodes, it first prunes \(e_{Pi} \in E_{P}\) and \(e_{Qj} \in E_{Q}\), respectively, similar as the function PRUNE (depicted in Algorithm 1), and then, for all the remaining entries \(e_{Pi}\) and \(e_{Qj}\) not pruned, it inserts all the entry pairs \(\langle e_{Pi},\,e_{Qj}\rangle \) with \(e\textit{CPD}_{k}<emindist(e_{Pi},\,e_{Qj})\,\le max\textit{CPD}_{k}\) into EH (lines 11–21 and 31–42). Thereafter, if \(mindist(e_{Pi},\,e_{Qj})\le max\textit{CPD}_{k}\), (1) for non-leaf entry pairs \(\langle e_{Pi},\,e_{Qj}\rangle \), EHM-PEP updates \(max\textit{CPD}_{k}\) using Lemmas 12, and adds \(\langle e_{Pi},\,e_{Qj}\rangle \) to \(S\) or CH or \(H\) based on \(mindist(e_{Pi},\,e_{Qj})\) (lines 22–29), and (2) for leaf entry pairs, it updates \(S_{R}\) and \(max\textit{CPD}_{k}\) (line 43).

The pseudo-code of EHM-C is presented in Algorithm 5. It performs a while-loop until both EH and CH are empty. In every loop, EHM-C gets the top entry pairs \(\langle E_{P},\,E_{Q}\rangle \) and \(\langle E_{P}',\,E_{Q}' \rangle \) from EH and CH, respectively. If the minimal value between \(emindist(E_{P},\,E_{Q})\) and \(mindist(E_{P}',\,E_{Q}')\) is larger than \(max\textit{CPD}_{k}\), the algorithm terminates, since EH and CH can not contain any actual answer object pair (lines 3–4). Otherwise, EHM-C compares \(emindist(E_{P},\,E_{Q})\) with \(mindist(E_{P}',\,E_{Q}')\). If \(emindist(E_{P},\,E_{Q})< mindist(E_{P}',\,E_{Q}')\), EHA-C pops the head entry pair \(\langle E_{P},\,E_{Q}\rangle \) from EH, and then checks whether \(\langle E_{P},\,E_{Q}\rangle \) can be pruned by Rule 1 (line 7). If not, (1) for the leaf entry pair, EHM-C updates \(S_{R}\) and \(max\textit{CPD}_{k}\) (lines 8–9); and (2) for the non-leaf entry pair, it inserts \(\langle E_{P},\,E_{Q}\rangle \) into CH for evaluation later (lines 10–11). If \(emindist(E_{P},\,E_{Q})\ge mindist(E_{P}',\,E_{Q}')\), EHM-C pops the top entry pair \(\langle E_{P}',\,E_{Q}'\rangle \) from CH, and calls IMA-PEP (depicted in Algorithm 2) to expand \(\langle E_{P}',\,E_{Q}' \rangle \).

We illustrate EHM using the running example shown in Fig. 7, and please refer to Appendix C for details.

Lemma 5

EHM can return exactly the actual \(\hbox {M}k\hbox {CP}\) query result, i.e., the algorithm has no false negatives, no false positives, and the result set contains no duplicate objects.

Proof

First, no real answer object pairs are missed (i.e., no false negatives), as only unqualified (leaf and non-leaf) entry pairs are discarded by Rules 12. In particular, for EHM, the entry pairs pruned using the aggressive pruning technique are preserved (not discarded), to be verified in the compensation phase. Second, all object entry pairs that cannot be discarded by Rules 12 are verified against other qualified entry pairs to ensure no false positives. Third, no duplicate object pairs are guaranteed because every qualified entry pair is pushed into only one corresponding min-heap according to its mindist and is popped right after evaluation. \(\square \)

Lemma 6

EHM only visit the intermediate entry pairs \(\langle E_{P},\,E_{Q}\rangle \) with \(mindist(E_{P},\,E_{Q})\le \textit{CPD}_{k}\) at most once.

Proof

Assume, to the contrary, that EHM visits an intermediate entry pair \(\langle E_{P},\,E_{Q}\rangle \) with \(mindist(E_{P},\,E_{Q})>\textit{CPD}_{k}\). For all the intermediate entry pairs \(\langle E_{P}',\,E_{Q}' \rangle \) that contain the answer object pairs \(\langle p,\,q\rangle \), \(mindist(E_{P}',\,E_{Q}')\le d(p,\,q)\le \textit{CPD}_{k}\). EHM accesses the intermediate entry pairs whose mindist equals to 0 in DF paradigm, and other intermediate entry pairs not pruned in BF paradigm; thus, in general, EHM visits entry pairs in ascending order of mindists. Hence, all \(\langle E_{P}',\,E_{Q}' \rangle \) are visited before \(\langle E_{P},\,E_{Q}\rangle \), and \(max\textit{CPD}_{k}\) is updated to \(\textit{CPD}_{k}\) before accessing \(\langle E_{P},\,E_{Q}\rangle \). Thus, \(\langle E_{P},\,E_{Q}\rangle \) can be pruned by Rule 1, which contradicts with our assumption. In order to complete the proof, we need to show that an entry pair is not visited multiple times. This is straightforward as each entry pair is evaluated a single once and is popped right after evaluation.

\({ {eCPD}}_{k}\) computation A challenging issue for EHM is to obtain an appropriate value of \(e\textit{CPD}_{k}\). If the value of \(e\textit{CPD}_{k}\) is too big, it can not solve the slow start. Otherwise, if a small \(e\textit{CPD}_{k}\) is used, it needs additional cost to perform compensation. To estimate \(\textit{CPD}_{k}\) accurately, we can utilize the distance distribution to obtain the \(e\textit{CPD}_{k}\) value. The overall distribution of distances over two metric datasets \(P\) and \(Q\) is defined as:

$$\begin{aligned} F(r)=\mathbf Pr {\{}d(p,q)\le r{\}} \end{aligned}$$
(1)

where \(p\) is a random object in \(P\), and \(q\) is a random object in \(Q\). The distance distribution is the correct counterpart of data distribution used for Euclidean spaces, since it is the natural way to characterize metric datasets.

Based on the above definition of distance distribution \(F\), \(\textit{CPD}_{k}\) can be estimated as the minimal \(r\) that has at least \(k\) object pairs with their distances bounded by \(r\):

$$\begin{aligned} e\textit{CPD}_{k} = min\{r \vert \vert P\vert \times \vert Q\vert \times F(r) \ge k\} \end{aligned}$$
(2)

As an example, assume that \(F(r)\) follows Uniform distribution in the range [0, 1] (i.e., \(0 \le r \le 1\)), then \(e\textit{CPD}_k\) can be set to \(k/(|P| \times |Q|)\). Note that, \(e\textit{CPD}_{k}\) calculated by formula (2) can also be used in our cost model (see Sect. 4.4) to estimate \(\textit{CPD}_{k}\).

4.4 Cost model

Next, we drive a node-based cost model for \(\hbox {M}k\hbox {CP}\) retrieval because, (1) it can be utilized by databases to choose promising execution plans, and (2) it can find out the factors that may affect the cost of an \(\hbox {M}k\hbox {CP}\) query and thus facilitate better algorithm development.

In order to take a more general view and be able to deal with generic metric spaces, our cost model relies on distance (rather than data) distribution \(F\), as \(F\) is the only information derivable from the analysis of metric datasets.

For an I/O optimal \(\hbox {M}k\hbox {CP}\) search algorithm, an entry pair \(\langle E_{P},\,E_{Q}\rangle \) has to be accessed iff \(mindist(E_{P},\,E_{Q})\le \textit{CPD}_{k}\) holds. Then, the probability of visiting \(\langle E_{P},\,E_{Q}\rangle \) can be denoted as:

$$\begin{aligned}&\mathbf Pr (\langle E_{P}, E_{Q}\rangle \text { is accessed}) \nonumber \\&\quad = \mathbf Pr (mindist(E_P, E_Q) \le \textit{CPD}_k) \nonumber \\&\quad = \mathbf Pr (d(E_P.RO, E_Q.RO) \le E_P.r + E_Q.r + \textit{CPD}_k) \nonumber \\&\quad = F(E_P.r + E_Q.r + \textit{CPD}_k) \end{aligned}$$
(3)

To determine the expected I/O cost (EIC) in terms of node (i.e., intermediate entry) accesses, it is sufficient to sum the above probabilities over all intermediate entry pairs between \(M_{P}\) and \(M_{Q}\):

$$\begin{aligned} EIC=2\times \sum _{i=1}^{\vert M_P \vert \times \vert M_Q \vert }{F({{E}_{P}}.r+{{E}_{Q}}.r+CP{{D}_{k}})} \end{aligned}$$
(4)

where \(\textit{CPD}_{k}\) is set to the value calculated by Eq. (2).

Next, we estimate the CPU cost for \(\hbox {M}k\hbox {CP}\) retrieval, in terms of selectivity [14] which is the ratio of distance computations to the total number of object pairs, i.e., # of distance computations \(/(\vert P\vert \times \vert Q\vert )\). As \(\vert P\vert \times \vert Q\vert \) can be easily computed, we only need to obtain the number of distance computations during \(\hbox {M}k\hbox {CP}\) search.

When expanding an intermediate entry pair \(\langle E_{P},\,E_{Q}\rangle \), our proposed algorithms first prune away \(e_{Pi} \in E_{P}\) and \(e_{Qj} \in E_{Q}\), respectively, for minimizing the cost of checking the Cartesian product of \(\langle e_{Pi},\,e_{Qj}\rangle \), and then evaluate every qualified entry pair \(\langle e_{Pi},\,e_{Qj}\rangle \). Therefore, the number of distance computations (NDC) for processing every \(\langle E_{P},\,E_{Q}\rangle \) can be calculated as:

$$\begin{aligned}&NDC(E_{P}, E_{Q}) = E_{P}.num \times \mathbf Pr (mindist(e_{Pi}, E_{Q})\text { is computed})\nonumber \\&\quad + E_{Q}.num \times \mathbf Pr (mindist(E_{P}, e_{Qj})\text { is computed}) + E_{P}.num \nonumber \\&\quad \times E_{Q}.num \times \mathbf Pr (mindist(e_{Pi}, e_{Qj}) \text { is computed}) \end{aligned}$$
(5)

For ease of analysis, we approximate \(max\textit{CPD}_{k}\) with \(\textit{CPD}_{k}\). Hence, a distance (i.e., mindist) is needed to calculate for the leaf or non-leaf entry pair \(\langle E_{P},\,E_{Q}\rangle \) iff \(emindist(E_{P},\,E_{Q})\le \textit{CPD}_{k}\). As \(emindist(e_{Pi},\,E_{Q})=\vert d(E_{P}.RO,\,E_{Q}.RO)-e_{Pi}.PD\vert -e_{Pi}.r-E_{Q}.r\), the probability that \(mindist(e_{Pi},\,E_{Q})\) needs to be computed can be denoted as:

$$\begin{aligned}&\mathbf Pr (mindist(e_{Pi}, E_{Q}) \text { is computed}) \nonumber \\&\quad = \mathbf Pr (emindist(e_{Pi}, E_{Q}) \le \textit{CPD}_{k}) \nonumber \\&\quad = \mathbf Pr (\vert d(E_{P}.RO, E_{Q}.RO) - e_{Pi}.PD \vert \nonumber \\&\qquad \le e_{Pi}.r + E_{Q}.r + \textit{CPD}_{k}) \nonumber \\&\quad = F(e_{Pi}.PD + e_{Pi}.r + E_{Q}.r + \textit{CPD}_{k}) \nonumber \\&\qquad -F(e_{Pi}.PD - e_{Pi}.r - E_{Q}.r - \textit{CPD}_{k} - \delta ) \end{aligned}$$
(6)

In Eq. (6), \(\delta =0\) when \(F\) is continuous (e.g., \(L_{P}\) -norm), and \(\delta =1\) when \(F\) is discrete (e.g., edit distance). Similarly, since \(emindist(E_{P},\,e_{Qj})=\vert d(E_{P}.RO,\,E_{Q}.RO)-e_{Qj}.PD\vert -E_{P}.r-e_{Qj}.r\), the probability that \(mindist(E_{P},\,e_{Qj})\) needs to be calculated can be denoted as:

$$\begin{aligned}&\mathbf Pr (mindist(E_{P}, e_{Qj}) \text { is computed}) \nonumber \\&\quad = F(e_{Qj}.PD + E_{P}.r + e_{Qj}.r + \textit{CPD}_{k}) \nonumber \\&\qquad - F(e_{Qj}.PD - E_{P}.r - e_{Qj}.r - \textit{CPD}_{k} - \delta ) \end{aligned}$$
(7)

For every entry pair \(\langle e_{Pi},\,e_{Qj}\rangle \), \(emindist(e_{Pi},\,e_{Qj})=\hbox {max}\{\vert d(e_{Pi}.RO,\,E_{Q}.RO)-e_{Qj}.PD\vert ,\,\vert d(E_{P}.RO,\,e_{Qj}. RO)-e_{Pi}.PD\vert \}- e_{Pi}.r - e_{Qj}.r\). Let \(\lambda \) be \(e_{Pi}.r+e_{Qj}.r+\textit{CPD}_{k}\). Since \(\langle e_{Pi}.RO,\,E_{Q}.RO\rangle \) and \(\langle E_{P}.RO,\,e_{Qj}.RO\rangle \) are two independent object pairs, the probability of computing \(mindist(e_{Pi},\,e_{Qj})\) is expressed as:

$$\begin{aligned}&\mathbf Pr (mindist(e_{Pi}, e_{Qj}) \text { is computed}) \nonumber \\&\quad = \mathbf Pr (\vert d(e_{Pi}.RO, E_{Q}.RO) - e_{Qj}.PD \vert \le \lambda ) \nonumber \\&\qquad \times \mathbf Pr (\vert d(E_{P}.RO, e_{Qj}.RO) - e_{Pi}.PD \vert \le \lambda ) \nonumber \\&\quad = (F(e_{Qj}.PD + \lambda ) - F(e_{Qj}.PD - \lambda - \delta )) \nonumber \\&\qquad \times (F(e_{Pi}.PD + \lambda ) - F(e_{Pi}.PD - \lambda - \delta )) \end{aligned}$$
(8)

To determine the expected selectivity cost (ESC), it is sufficient to sum the above NDC over all intermediate entry pairs between \(M_{P}\) and \(M_{Q}\) with respect to \(\vert P\vert \times \vert Q\vert \):

$$\begin{aligned} ESC = \frac{\sum \nolimits _{i = 1}^{\vert M_P \vert \times \vert M_Q \vert } {NDC(E_P ,E_Q )} }{\vert P\vert \times \vert Q\vert } \end{aligned}$$
(9)

In summary, Eqs. (4) and (9) indicate that the query cost of \(\hbox {M}k\hbox {CP}\) search depends on several factors: (1) the cardinalities of object sets, (2) the M-tree/COM-tree structure, (3) the value of \(k\), and (4) the distance distribution \(F(r)\), which are verified one by one in Sect. 6.2. Note that, although the data distribution is an important factor that might affect query cost in the Euclidean space, the distance distribution is an counterpart used for the metric space, which can be obtained by the sampled dataset.

5 Extensions

In this section, we study two interesting variants of \(\hbox {M}k\hbox {CP}\) queries, i.e., Self Metric kCP \((\hbox {SM}k\hbox {CP})\) search and Approximate Metric kCP \((\hbox {AM}k\hbox {CP})\) retrieval, and present how our proposed algorithms and pruning rules can be adapted accordingly to solve these variations.

5.1 Self \(\hbox {M}k\hbox {CP}\) search

The \(\hbox {M}k\hbox {CP}\) query focuses on two different object sets, i.e., \(P\ne Q\). However, in some real-life applications [6, 18, 30], two object sets may be identical (i.e., \(P=Q\)). As an example, clustering [30] and outlier detecting [6] algorithms aim at the same object set. Motivated by this, we introduce a natural variant of \(\hbox {M}k\hbox {CP}\) queries, namely Self Metric \(k\hbox {CP}\,(\hbox {SM}k\hbox {CP})\) retrieval.

Definition 5

(\({ SM}k{ CP}\) Search). Given an object set \(P\) in a metric space and an integer \(k\,(1 \le k \le (\vert P\vert ^{2}-\vert P\vert )/2)\), \(a\) Self Metric \(k\hbox {CP}\,(\hbox {SM}k\hbox {CP})\) query finds \(k\) ordered different closest object pairs \(\langle p_{i},\,p_{j}\rangle \) with \(p_{i},\,p_{j}\in P\) and \(p_{i} \ne p_{j}\).

An example of \(\hbox {SM}k\hbox {CP}\) search is depicted in Fig. 9, where \(P=\{p_{1},\,p_{2},\,p_{3},\,p_{4}\}\) and \(k=2\). The result set of the SM2CP query is \(\{\langle p_{1},\,p_{2}\rangle ,\,\langle p_{2},\,p_{3}\rangle \}\).

Fig. 9
figure 9

Example of \(\hbox {SM}k\hbox {CP}\) search. a Illustration of a SM2CP query. b Illustration of Lemma 7

5.1.1 Estimation-based hybrid \(\hbox {SM}k\hbox {CP}\) algorithm

Our algorithms designed for \(\hbox {M}k\hbox {CP}\) queries (discussed in Sect. 4) can be flexible to support \(\hbox {SM}k\hbox {CP}\) retrieval. Since \(\langle p_{i},\,p_{j}\rangle \) and \(\langle p_{j},\,p_{i}\rangle \) are treated as the same object pair, and \(\langle p_{i},\,p_{i}\rangle \) cannot be contained in the final result set of \(\hbox {SM}k\hbox {CP}\) search, we need to perform the \(\hbox {M}(\vert P\vert +2\hbox {k})\hbox {CP}\) query and filter out unqualified and duplicated object pairs. Nevertheless, it is very inefficient as \(\vert P\vert + 2k\) is much bigger than \(k\), especially for larger object set \(P\). In order to further improve the \(\hbox {SM}k\hbox {CP}\) query performance, we develop Estimation-based Hybrid SMkCP Algorithm (EHS), which takes advantage of EHM algorithm (i.e., it performs the best as verified by our experiments in Sect. 6.2), and meanwhile integrates the characteristics of \(\hbox {SM}k\hbox {CP}\) retrieval.

Since \(\hbox {SM}k\hbox {CP}\) search performs on the same object set, i.e., \(P=Q\), the intermediate entries of the entry pair \(\langle E_{P},\,E_{Q}\rangle \) to be processed can be identical. However, Lemmas 12, used to derive \(max\textit{CPD}_{k}\) values, are only suitable when given two intermediate entries are different. Thus, we present a new lemma, to cover the situation when \(E_{P}\) and \(E_{Q}\) point to the same entry.

Lemma 7

Given an intermediate entry \(E_{P}\), \(max\textit{CPD}_k\) =

$$\begin{aligned} \left\{ \begin{array}{l} E_{P}.r \\ 2\times E_{P}.r \\ \end{array}\begin{array}{l} \quad E_{P}.num>k \\ \quad E_{P}.num\le k\le (E_{P}.nu{m}^{2}-E_{P}.num)/2 \\ \end{array}\right. \end{aligned}$$

Proof

In order to prove this lemma, \(max\textit{CPD}_{k}\) can be updated to \(r\) if having \(k\) different object pairs with their distances bounded by \(r\).

If \(E_{P}.num> k\), there exists \(k\) different object pairs \(\langle p_i,\,E_{P}.RO\rangle \,(p_i \ne E_{P}.RO,\,p_i \in E_{P})\) with their distances \(d(p_i,\,E_{P}.RO)\) bounded by \(E_{P}.r\), as \(E_{P}.RO\) is a real object. Hence, \(max\textit{CPD}_{k}\) can be set to \(E_{P}.r\) when \(E_{P}.num>k\).

Otherwise, i.e., \(E_{P}.num \le k\), as depicted in Fig. 9b, there exists \((E_{P}.num^{2}-E_{P}.num)/2\) different object pairs \(\langle p_{i},\,p_{j}\rangle \) in \(ST_{E_P}\) with their distances \(d(p_{i},\,p_{j})\,\le p_{i}.PD+p_{j}.PD \le 2 \times E_{P}.r\), due to the triangle inequality. Consequently, \(max\textit{CPD}_{k}\) can be set as \(2 \times E_{P}.r\) if \(E_{P}.num \le k \le (E_{P}.num^{2}-E_{P}.num)/2\). \(\square \)

The framework of EHS is similar as EHM, the only difference is that, EHM calls EHM-AP for aggressive pruning, whereas EHS invokes EHS-AP, which is presented in Algorithm 6. Initially, it updates \(max\textit{CPD}_{k}\) using Lemmas 1,  2, and  7, and inserts root entry pairs \(\langle E_{P},\,E_{Q}\rangle \,(E_{P} \in M_{P},\,E_{Q}\in M_{P})\) not pruned by Rule 1 into \(S\) (for the DF traversal) or \(H\) (for the BF traversal) or CH (for compensation), according to \(mindist(E_{P},\,E_{Q})\) (line 1). Then, EHS-AP visits each entry pair \(\langle E_{P},\,E_{Q}\rangle \) of \(S\) until \(S\) is empty, and calls the EHS-PEP function to expand \(\langle E_{P},\,E_{Q}\rangle \) (lines 2–4). Next, EHS-AP visits every entry pair \(\langle E_{P},\,E_{Q}\rangle \) of \(H\) until \(H\) is empty or \(mindist(E_{P},\,E_{Q})> max\textit{CPD}_{k}\), and invokes EHS-PEP to expand \(\langle E_{P},\,E_{Q}\rangle \) (lines 5–9).

In EHS-PEP, if \(E_{P}\ne E_{Q}\), it calls EHM-PEP directly (line 33). Otherwise, i.e., \(E_{P}=E_{Q}\), if \(E_{P}\) points to non-leaf nodes, for each sub entry \(e_{Pi}\) of \(E_{P}\), it pushes \(\langle e_{Pi},\,e_{Pi}\rangle \) into \(S\) and updates \(max\textit{CPD}_{k}\) using Lemma 7 (line 13). In order to avoid duplicated entry pair accesses, for every sub entry \(e_{Pj}\) of \(E_{P}\) stored after \(e_{Pi}\), it inserts \(\langle e_{Pi},\,e_{Pj}\rangle \) into EH if \(e\textit{CPD}_{k}<emindist(e_{Pi},\,e_{Pj})\le max\textit{CPD}_{k}\), or inserts \(\langle e_{Pi},\,e_{Pj}\rangle \) into \(S\), or \(H\), or CH based on \(mindist(e_{Pi},\,e_{Pj})\). If \(E_{P}\) points to leaf nodes, in order to avoid duplicated object pair accesses, for each sub entry \(e_{Pi}\) of \(E_{P}\) and \(e_{Pj}\) of \(E_{P}\) stored after \(e_{Pi}\), it adds \(\langle e_{Pi},\,e_{Pj}\rangle \) to EH if \(e\textit{CPD}_{k}<emindist(e_{Pi},\,e_{Pj})\le max\textit{CPD}_{k}\) (line 29), or updates \(S_{R}\) and \(max\textit{CPD}_{k}\) if \(mindist(e_{Pi},\,e_{Pj})\,\le max\textit{CPD}_{k}\) (line 31).

figure f

We illustrate EHS using the SM2CP (\(k\) = 2) query on the object set \(O\) shown in Fig.  2a, and please refer to Appendix D for details.

5.1.2 COMdnn-tree-based SMkCP algorithm

Although EHS utilizes the characteristics of \(\hbox {SM}k\hbox {CP}\) search to boost query efficiency, it takes the framework of EHM, in which the performance degenerates as the overlapping between two object sets increases, as confirmed in Sect. 6.2. This is because, for two object sets with greater percentage of overlapping, EHM has to visit a lot of entry pairs with smaller mindist. However, \(\hbox {SM}k\hbox {CP}\) retrieval performs on one object set, i.e., the overlapping percentage is 100 %, resulting in poor query performance. To this end, we introduce a variant of COM-trees, namely COMdnn-tree, which integrates the nearest neighbor (NN) information into a COM-tree to improve the efficiency of \(\hbox {SM}k\hbox {CP}\) query processing. Note that, the COMdnn-tree can support efficient \(\hbox {M}k\hbox {CP}\) and \(\hbox {SM}k\hbox {CP}\) queries simultaneously.

Figure 10 shows an example of COMdnn-tree on the object set depicted in Fig. 2a, which includes e.dnn in each leaf or non-leaf entry \(e\) to represent the NN distance of the object or the minimum NN distance for all the objects contained in \(ST_{e}\). For instance, \(o_{9}.dnn=r_{4}\), since the distance from \(o_{9}\) to its NN (i.e., \(o_{10}\)) equals to \(r_{4}\), and \(e_{6}.dnn=r_{4}\), as \(r_{4}\) is the minimum dnn for all objects \(o_{6}\), \(o_{7}\), \( o_{8}\), and \(o_{9}\) contained in \(ST_{e_6}\). Note that, the number associated with every entry denotes e.dnn, which is obtained and stored during the construction of COMdnn-tree. In particular, when an object \(o\) is to be inserted into the COMdnn-tree, a NN query is performed to get its dnn value, and a reverse NN query is also conducted to find the objects whose dnn values could be affected by \(o\).

Fig. 10
figure 10

Example of a COMdnn-tree

Lemma 8

Given an object set P in generic metric spaces, if \(\langle p_i, p_j\rangle \) is contained in the SMkCP query result, then \(p_i.dnn\) and \(p_j.dnn\) are not larger than \(p_{2k}.dnn\), where \(p_i.dnn\) denotes \(p_i\)’s NN distance in P, and \(p_{2k}.dnn\) represents the 2kth NN distance among all sorted objects in P.

Proof

To prove this lemma, we first prove that \(max\textit{CPD}_{k}\) can be set to \(p_{2k}.dnn\). Since the objects in \(P\) are sorted according to their dnn values, there exists \(2k\) object pairs \(\langle p_{i},\,NN(p_{i})\rangle \,(1\le i\le 2k)\) with their distances bounded by \(p_{2k}.dnn\). Since \(p_{i}\) and \(NN(p_{i})\) might be mutual NN, i.e., \(\langle p_{i},\,NN(p_{i})\rangle \) and \(\langle NN(p_{i}),\,p_{i}\rangle \) representing the same object pair, there exists at least \(k\) different object pairs with their distances bounded by \(p_{2k}.dnn\). Thus, \(max\textit{CPD}_{k}\) can be set to \(p_{2k}.dnn\). If \(\langle p_{i},\,p_{j}\rangle \) is contained in the \(\hbox {SM}k\hbox {CP}\) query result, then \(d(p_{i},\,p_{j})\le \textit{CPD}_{k} \le max\textit{CPD}_{k} = p_{2k}.dnn\). Due to the definition of NN, \(p_{i}.dnn \le d(p_{i},\,p_{j})\) and \(p_{j}.dnn\le d(p_{i},\,p_{j})\), and both \(p_{i}.dnn\) and \(p_{j}.dnn\) are not larger than \(p_{2k}.dnn\), which completes the proof. \(\square \)

According to Lemma 8, we present COMdnn-tree-based SMkCP Algorithm (MSA). In the first phase, by traversing COMdnn-Tree, MSA obtains the candidate object set CH, as the dnn of each object in CH is not larger than \(p_{2k}.dnn\). In the second phase, the algorithm verifies the object pairs in the candidate object set in order, to get the final result set.

figure g

Algorithm 7 depicts the pseudo-code of MSA. It takes the COMdnn-tree \(M_{P}\) as an input, and outputs the result set \(S_{R}\) of a \(\hbox {SM}k\hbox {CP}\) query. First, it initializes \(max\textit{CPD}_{k}\) to infinity, and two min-heaps \(H\) and CH (line 1). Then, MSA inserts root entries \(E_{P}\) of \(M_{P}\) into \(H\) with \(E_{P}.dnn \le max\textit{CPD}_{k}\), and performs a while-loop until \(H\) is empty (lines 3–9). Each time, it pops the top entry \(E_{P}\) of \(H\), and verifies whether the early termination is satisfied. If \(E_{P}\) points to the non-leaf node, MSA inserts all the qualified sub-entries of \(E_{ P}\) into \(H\). Otherwise, i.e., \(E_{P}\) points to the leaf node, MSA inserts all the qualified sub-entries of \(E_{P}\) into CH and updates \(max\textit{CPD}_{k}\) using the \(2k\)th dnn in CH. Thereafter, for each object \(p_{i}\) in CH, if \(p_{i}.dnn>max\textit{CPD}_{k}\), the algorithm terminates and returns \(S_{R}\) (line 11); otherwise, in order to avoid duplicated distance computations, for each object \(p_{j}\) already visited (i.e., stored before \(p_{i}\) in CH), if \(d(p_{i},\,p_{j})\le max\textit{CPD}_{k}\), it updates \(S_{R}\) and \(max\textit{CPD}_{k}\) (lines 13–14). Finally, the result set \(S_{R}\) is returned (line 15).

We illustrate MSA using the SM2CP (\(k\) = 2) query on the object set \(O\) shown in Fig.  2a, and please refer to Appendix E for details.

5.1.3 Discussion

In this section, we first present our method to compute the estimation value \(e\textit{CPD}_{k}\), which is needed for EHS, and then analyze the I/O complexities of two algorithms. Finally, we discuss how to further improve the efficiency of MSA.

For EHS developed for \(\hbox {SM}k\hbox {CP}\) search, a challenge issue is to estimate \(\textit{CPD}_{k}\) accurately, because it might affect the efficiency of the algorithm. Similar as EHM, to obtain \(e\textit{CPD}_{k}\), we can utilize the distance distribution over the metric dataset \(P\), which is defined as:

$$\begin{aligned} F(r) = \mathbf Pr {\{}d(p_{i}, p_{j}) \le r{\}} \end{aligned}$$
(10)

where \(p_{i}\) and \(p_{j}\) are two random objects in \(P\). Based on the distance distribution \(F\), \(\textit{CPD}_{k}\) can be estimated as the minimal \(r\) that has at least \(k\) object pairs with their distances bounded by \(r\), i.e.,

$$\begin{aligned} e\textit{CPD}_{k} = min{\{}r \vert (\vert P\vert ^{2} - \vert P\vert )/2 \times F(r) \ge k {\}} \end{aligned}$$
(11)

Next, we analyze the I/O costs of the algorithms designed for \(\hbox {SM}k\hbox {CP}\) retrieval. Let \(\vert M_{P}\vert \) be the total number of nodes in \(M_{P},\,\vert M_{P}\vert _{i}\) be the number of nodes in the level \(i\) of \(M_{P}\), and \(L\) be the height of \(M_{P}\).

Lemma 9

The I/O costs of EHS and MSA algorithms are \(\hbox {O}(\sum \nolimits _{i=0}^{L-1}{|{{M}_{P}}|_{i}^{2}})\) and \(\hbox {O}(|M_P|)\), respectively.

Proof

Since EHS follows the framework of EHM, which needs to visit intermediate entry pairs \(\langle E_P, E_Q \rangle \) with mindist \((E_P,E_Q)\le \textit{CPD}_k\) according to Lemma 6, it has to access all intermediate entry pairs at the same level of \(M_{P}\) in the worst case. Hence, the I/O cost of EHS is \(O(\sum \nolimits _{i=0}^{L-1}{|{{M}_{P}}|_{i}^{2}})\). However, according to Lemma 8, MSA only needs to traverse \(M_{P}\) once in worst case, to obtain all candidates with their \(dnn\hbox {s}\) within \(p_{2k}.dnn\). Thus, the I/O cost of MSA is \(O(\vert M_{P}\vert )\). \(\square \)

According to Lemma 9, the I/O cost of MSA is much smaller than that of EHS, which is also verified in Sect. 6.5. In addition, it can also be demonstrated using Examples 4 and 5 (see Appendixes D and E), in which MSA only needs 4 node accesses, whereas EHS needs 13 node accesses.

For MSA, in the first phase, it traverses \(M_{P}\) to get \((2k+n)\) candidate objects, whose dnn are not larger than \(p_{2k}.dnn\). Note that, \(n\) is needed when the distance ties occur, i.e., there exists more than one object whose dnn equals to \(p_{2k}.dnn\). Then, in the second phase, MSA needs to verify all the object pairs among the candidate objects in order. Hence, the CPU cost (in terms of the number of distance computations) for MSA is \(\hbox {O}(k^{2})\). Thus, for larger \(k\), especially when \(k\) approaches the object set size \(\vert P\vert \), MSA degenerates to the naive solution for \(\hbox {SM}k\hbox {CP}\) retrieval, which compares every object pairs to obtain the final result set.

In order to minimize the effect of quadratic cost of verification, we can partition the verified objects into disjoint groups. Similar as the intermediate entry defined in the M-tree, each group \(G\) is represented by a routing object \(G\). RO with a fixed radius \(G.r\). Every visited object \(o\) is inserted into a group if \(d(o,\,G.RO)\le G.r\). Note that, for an object \(o\), it may exist more than one group \(G_{i}\) satisfying the condition that \(d(o,\,G_{i}.RO)\le G_{i}.r\). Here, in order to obtain disjoint groups, i.e., \(G_{i} \cap G_{j}=\emptyset (i\ne j)\), we choose the first group that satisfies the condition. If \(o\) cannot be inserted into any group, i.e., no group satisfying the condition, we create a new group \(G\), with \(G.RO=o\) and a fixed radius \(G.r\). Consider the \(\hbox {SM}k2\hbox {CP}\) query shown in Fig. 2a, assuming that \(r_{1}\) is chosen as the fixed radius. During the first phase, MSA obtains the candidate object set \(\{o_{3},\,o_{4},\,o_{1},\,o_{2}\}\). Figure 11a depicts the group \(G_{1}\) after objects \(o_{3}\) and \(o_{4}\) are visited. Next, when processing object \(o_{1}\), as \(d(o_{1},\,G_{1})>r_{1}\), we create a new group \(G_{2}\) with \(G_{2}.RO=o_{1}\) and \(G_{2}.r=r_{1}\), as illustrated in Fig. 11b.

Fig. 11
figure 11

Example of groups for MSA. a After processing \(o_3\) and \(o_4\). b After processing \(o_3\), \(o_4\), and \(o_1\)

By utilizing the grouping technique with a fix group radius \(r\), MSA can be adapted to \(r\)-MSA, to reduce considerable distance computations during the verification. In particular, for an object to be verified, \(r\)-MSA first compares it with all the groups, instead of every object contained in each group. According to Rule 1 presented in Sect. 3.3, if \(mindist(o,\,G)>max\textit{CPD}_{k}\), then \(\langle o,\,G\rangle \) can be pruned. In other words, we can avoid evaluating all the objects contained in \(G\) for \(o\). Consider the example shown in Fig. 11 again. For the object \(o_{2}\) to be verified, \(\langle o_{2},\,G_{1}\rangle \) can be pruned away due to \(mindist(o_{2},\,G_{1})>max\textit{CPD}_{k}(=r_1)\).

5.2 Approximate \(\hbox {M}k\hbox {CP}\) search

For \(\hbox {M}k\hbox {CP}\) retrieval, although we can utilize pruning heuristics and the aggressive pruning and compensation technique to accelerate query processing, efficiency still remains a problem since the query cost remains quadratic in worst case, especially for the high degree of overlapping between two object sets. Hence, it makes sense to study the Approximate Metric kCP \((\hbox {AM}k\hbox {CP})\) query, which trades the quality of the result for search time.

Definition 6

(\({ AM}k{ CP}\) Search). Given two object sets \(P\) and \(Q\) in a generic metric space, and an integer \(k(1\le k\le \vert P\vert \times \vert Q\vert )\), an Approximate \(\hbox {M}k\hbox {CP}\,(\hbox {AM}k\hbox {CP})\) query finds \(k\) ordered different object pairs from \(P \times Q\), i.e., \(AMkCP(P,\,Q)=\{\langle p_{1},\,q_{1}\rangle ,\,\langle p_{2},q_{2}\rangle ,\ldots ,\langle p_{k}, q_{k}\rangle \vert p_{1}, p_{2},\ldots , p_{k} \in P,\, q_{1},\,q_{2},\ldots ,q_{k} \in Q\), \(\langle p_{i},\,q_{i}\rangle \ne \langle p_{j},\,q_{j}\rangle \), \(i \ne j,\,1 \le i,\,j \le k\), and \(\forall \langle p_{i}',\,q_{i}'\rangle \in MkCP(P,\,Q)\,\hbox {s.t.}\,d(p_{i},\,q_{i})\ge d(p_{i}',\, q_{i}')\}\).

Consider the example depicted in Fig. 1 again. An AM2CP \((k=2)\) query may return the result set \(\{\langle p_{2},\,q_{1}\rangle ,\,\langle p_{2},\,q_{3}\rangle \}\), which is different from the result set \(\{\langle p_{2},\,q_{1}\rangle ,\,\langle p_{2},\,q_{2}\rangle \}\) returned by the M2CP query.

5.2.1 Estimation-based hybrid AMAkCP algorithm

In order to forsake some precision in exchange for improved efficiency, we can utilize the framework of EHM (which performs the best for \(\hbox {M}k\hbox {CP}\) search as verified in Sect. 6.2), by integrating a popular approximate technique, i.e., \(\varepsilon \)-approximate technique [13, 20, 53]. Given a positive real \(\varepsilon \,(\ge 0)\) as the maximum distance relative error to be tolerated, for the \(i\)th answer object pair \(\langle p_{i},\,q_{i}\rangle \) contained in \(\hbox {AM}k\hbox {CP}(P,\,Q)\) and the \(i\)th answer object pair \(\langle p_{i}',\,q_{i}'\rangle \) included in \(\hbox {M}k\hbox {CP}(P,\,Q),\,\varepsilon \)-approximate technique makes that \((d(p_{i},\,q_{i})-d(p_{i}',\,q_{i}'))/d(p_{i}',\,q_{i}')\le \varepsilon \) holds.

However, since \(\varepsilon (\ge 0)\) is unlimited, it is not easy for users to adjust the quality of the query result. Toward this, in this paper, we choose the \(\alpha \)-allowance technique [20] with a bounded parameter \((0<\alpha \le 1)\), which can be transferred to the \(\varepsilon \)-approximate technique with \(\alpha =1/(1+\varepsilon )\). Below, we propose an approximate pruning rule based on the \(\alpha \)-allowance technique.

Rule 3

Given two leaf or non-leaf entries \(E_{P}\) and \(E_{Q}\), and a positive real \(\alpha (0<\alpha \le 1)\), if \(\hbox {emindist}(E_{P},\,E_{Q})\) or mindist \((E_{P},\,E_{Q})\) is larger than \(max\textit{CPD}_{k}\times \alpha \), then \(\langle E_{P},\,E_{Q}\rangle \) can be pruned away safely.

Proof

Given a relative distance error \(\varepsilon \) to be tolerated, if \(eimindist(E_P, E_Q)\) or \(mindist(E_P,\,E_Q)\) is larger than \(max\textit{CPD}_k\times \alpha \), then, for any \(\langle p, q \rangle \,(p \in E_P,\,q \in E_Q)\), \(d(p, q)\times (1+\varepsilon )\ge mindist(E_P,\,E_Q)\) (or \(emindist(E_{P},\,E_{Q})\)) \(\times (1 + \varepsilon ) > max\textit{CPD}_k \times \alpha \times (1 + \varepsilon ) > \textit{CPD}_k^\prime \) since \(\alpha = 1/(1+\varepsilon )\) and \(\textit{CPD}_k^\prime \) denotes the accurate \(k\)th closest pair distance. Therefore, \(\langle p, q \rangle \) cannot be an actual answer object pair due to \(d(p, q)\times (1+\varepsilon )>\textit{CPD}_k^\prime \), and \(\langle E_P,E_Q\rangle \) can be discarded safely accordingly. \(\square \)

As depicted in Fig. 6, assume that \(max\textit{CPD}_{k}=40\) and \(\alpha =0.5\), then \(\langle E_{P2}, E_{Q3}\rangle \) that cannot be pruned by Rules 12 can be discarded by Rule 3.

The pseudo-code of Estimation-based Hybrid AMkCP algorithm (EHA) is similar as EHM and thus omitted. It takes as inputs two COM-trees \(M_{P}\) and \(M_{Q}\), an estimated value \(e\textit{CPD}_{k}\), and a real \(\alpha \,(0<\alpha \le 1)\), and outputs the result set \(S_{R}\) of an \(\hbox {AM}k\hbox {CP}\) query. The only difference is that, EHA utilizes Rule 3 to prune leaf or non-leaf entry pairs, while EHM uses Rules 1 and 2.

5.2.2 GMdnn-tree-based AMkCP algorithm

As pointed out by [20, 53], although the \(\alpha \)-approximate (\(\varepsilon \)-approximate) technique utilized by EHA can achieve the high quality result set, the query efficiency does not improve much, which is also verified in Sect. 6.6. Motivated by this, we present GMdnn-tree-based \(\hbox {AM}k\hbox {CP}\) Algorithm (GMA), which employs the grouping and \(N\)-consider techniques to control the trade-off between query cost and accuracy of the query result.

GMdnn-tree is a variant of COMdnn-trees, which partitions the objects in each leaf node into disjoint groups. In particular, if two objects \(p\) and \(q\) are similar, i.e., the distance \(d(p,\,q)\) between \(p\) and \(q\) is small, they can form a group. This is because, due to the triangle inequality, the difference between the distances from \(p\) and \(q\) to any other object \(o\) , i.e., \(|d(p,\,o)-d(q,\,o)|\) is small if \(p\) and \(q\) are similar, as \(|d(o,\,p)-d(o,\,q)|\le d(p,\,q)\). Figure 12 shows two GMdnn-trees \(M_{P}\) and \(M_{Q}\) on the object sets \(P\) and \(Q\) (depicted in Fig. 7a), respectively. For instance, objects \(p_{3},\,p_{4}\), and \(p_{5}\) contained in the leaf node pointed by \(E_{P5}\) are partitioned into two disjoint groups, i.e., \(g_{P2}=\{p_{3},\,p_{5}\},\,g_{P3}=\{p_{4}\}\). Note that, \(g_{P3}\) only contains one object, since it does not exist any other object in this leaf node. As two objects in the same group are similar, one of them can be used to represent the whole group. For simplify, the first object in each group is chosen as the representative object. Also note that, the GMdnn-tree can support efficient \(\hbox {M}k\hbox {CP},\,\hbox {SM}k\hbox {CP}\), and \(\hbox {AM}k\hbox {CP}\), queries simultaneously.

Fig. 12
figure 12

Example of GMdnn-trees. a GMdnn-tree \(M_{P}\) on \(P\). b GMdnn-tree \(M_{Q}\) on \(Q\)

With the help of GMdnn-tree, we can improve query efficiency significantly, as only the representative object of each group instead of the whole group is verified. Consider the example illustrated in Fig. 12 again, assume that \(p_{1}\) and \(q_{2}\) are representative objects of \(g_{P1}\) and \(g_{Q1}\), respectively. All the object pairs between \(g_{P1}=\{p_{1},\,p_{2}\}\) and \(g_{Q1}=\{q_{2},\,q_{3}\}\) can be estimated using \(\langle p_{1},\,q_{2}\rangle \), i.e., \(mindist(g_{P1},\,g_{Q1})=d(p_{1},\,q_{2})\). In other words, if \(\langle p_{1},\,q_{2}\rangle \) is pruned by Rule 1 or 2, other object pairs including \(\langle p_{1},\,q_{3}\rangle \), \(\langle p_{2},\,q_{2}\rangle \), and \(\langle p_{2},\,q_{3}\rangle \) can also be pruned; otherwise, if \(\langle p_{1},\,q_{2}\rangle \) cannot be discarded by Rules 1 and 2, other object pairs between two groups cannot be pruned as well.

A challenge issue for building the GMdnn-tree is how to group objects efficiently. A simple but efficient method is to choose a far away object \(o\), and then partition the objects into disjoint groups in order of their distances to the chosen object \(o\). Note that, a far away object \(o\) is needed, because the similarity of two objects \(p_{i}\) and \(p_{j}\) can be estimated well using the distance difference between \(d(o,\,p_{i})\) and \(d(o,\,p_{j})\) [7]. In this paper, we choose the far away object among all the routing objects stored in the GMdnn-tree. Take the example shown in Fig. 7 again. When grouping the objects in a leaf node \(E_{P5}\), we can choose the routing object \(p_{8}\) of \(E_{P7}\), and sort the objects in \(E_{P5}\) in ascending order of their distances to \(p_{8}\), i.e., \(p_{3},\,p_{5}\), and \(p_{4}\). Hence, \(E_{P5}\) can be partitioned into two groups \(g_{P2}=\{p_{3},\,p_{5}\}\) and \(g_{P3}=\{p_{4}\}\).

Since GMdnn-trees are used to only achieve the approximation at the leaf node level, the \(N\)-consider technique [20] can be utilized in the intermediate node level, in order to further boost query efficiency. In particular, when visiting the entry pair \(\langle E_{P},\,E_{Q}\rangle \) which points to the intermediate nodes, we only consider the percentage \(N\) of all the sub entry pairs \(\langle e_{Pi},\,e_{Qj}\rangle \,(e_{Pi} \in E_{P},\,e_{Qj}\in E_{Q})\).

Since the framework of GMA is similar as that of EHM, its pseudo-code is omitted here. The only difference between GMA and EHM is the processing for the intermediate entry pair \(\langle E_{P},\,E_{Q}\rangle \). If \(E_{P}\) and \(E_{Q}\) are intermediate entries pointing to non-leaf nodes, GMA uniformly chooses \(\sqrt{N}\times E_{P}.num\) sub-entries of \(E_{P}\) and \(\sqrt{N}\times E_{Q}.num\) sub-entries of \(E_{Q}\) for processing, in order to apply the \(N\)-consider technique. If \(E_{P}\) and \(E_{Q}\) are intermediate entries pointing to leaf nodes, it verifies and prunes groups instead of every object contained in groups.

5.2.3 Discussion

To quantify an approximate algorithm, it should consider not only improvement in performance efficiency, but also the quality of approximation. The quality of approximation can be measured by using the precision, i.e., the percentage of the \(k\) items of the approximate result that also contained in the exact result set.

Definition 7

(Precision). Given two object sets \(P\) and \(Q\) in a generic metric space, and an integer \(k(1\le k\le \vert P\vert \times \vert Q\vert \)), assume that \(\langle p_{i},\,q_{i}\rangle \) and \(\langle p_{i}',\,q_{i}' \rangle \) is the \(i\)th item contained in \(AMkCP(P,\,Q)\) and \(MkCP(P,\,Q)\), respectively.

$$\begin{aligned} precision=\frac{1}{k}\sum \limits _{i=1}^{k}{\left\{ \begin{matrix} 1\\ 0\\ \end{matrix}\right. \begin{array}{l} \quad d(p_{i},q_{i})\le d(p_{k}^{'},q_{k}^{'}) \\ \quad \text {otherwise} \\ \end{array}} \end{aligned}$$

Note that, we use the distance to determine whether \(\langle p_{i},\,q_{i}\rangle \) is contained in the exact result set \(MkCP(P,\,Q)\), because, as analyzed in Sect. 3.1, \(MkCP(P,\,Q)\) may be not unique due to the distance tie, and we randomly choose object pairs when the distance tie occurs. When precision = 1, the approximate result equals to the exact one. On the other hand, precision tends to 0 in worst case.

First, we derive the \(precision\) for the approximate algorithm EHA, which only utilizes the \(\alpha \)-approximate technique. In order to obtain \(precision(\hbox {EHA})\), we need to determine under what conditions an object pair \((\in MkCP(P,\,Q))\) is certainly contained in \(AMkCP(P,\,Q)\) returned by EHA, as stated as follows.

Lemma 10

Given two object sets P and Q in a generic metric space, and a real value \(\alpha (0<\alpha \le 1)\), assume that \(\langle p_{i}',\,q_{i}'\rangle \) is the \(i\)th item contained in \(\hbox {M}k\hbox {CP}(P,\,Q)\). If \(d(p_{i}',\,q_{i}')\le \alpha \times d(p_{k}',\,q_{k}')\), then \(\langle p_{i}',\,q_{i}' \rangle \) should be also contained in \(\hbox {AM}k\hbox {CP}(P,\,Q)\) returned by EHA.

Proof

Assume, to the contrary, that \(\langle p_{i}',\,q_{i}'\rangle \) is not included in \(AMkCP(P,\,Q)\), i.e., \(\langle p_{i}',\,q_{i}'\rangle \) is pruned by Rule 3. For a leaf or non-leaf entry pair \(\langle E_{P},\,E_{Q}\rangle \) is or contain \(\langle p_{i}',\,q_{i}' \rangle ,\,emindist(E_{P},\,E_{Q}) \le mindist(E_{P},\,E_{Q})\,\le d(p_{i}',\,q_{i}') \le \alpha \times d(p_{k}',\,q_{k}')\le \alpha \times max\textit{CPD}_{k}\), which contradicts with the condition of Rule 3. Thus, the proof completes. \(\square \)

According to Lemma 10, we can get precision(\(EHA\)) bounded by:

$$\begin{aligned} \hbox {precison}(EHA)\ge & {} \frac{1}{k}\sum \limits _{i=1}^{k}\left\{ \begin{matrix} 1 \\ 0 \\ \end{matrix}\right. \begin{array}{l} \quad d(p_{i}^{'},q_{i}^{'})\le \alpha \times d(p_{k}^{'},q_{k}^{'}) \\ \quad \text {otherwise} \\ \end{array}\\\ge & {} \frac{F(\alpha \times d(p_{k}^{'},q_{k}^{'}))}{F(d(p_{k}^{'},q_{k}^{'}))} \end{aligned}$$

where \(F\)( ) denotes the distance distribution between two object sets \(P\) and \(Q\). If \(F\)( ) follows the uniform distribution, then precision(\(EHA\)) is bounded by \(\alpha \).

Next, we derive the lower bound of precision for GMA. For the grouping technique used in the leaf node level, since only one object pair is verified to represent four object pairs, and thus, the precision for the algorithm using grouping technique is bounded by 0.25. For each application of the \(N\)-consider technique in an intermediate node level, the probability that does not discard a real answer object pair is \(N\). Hence, the precision for the algorithm using the \(N\)-consider technique equals to \(N^{L-2}\), where \(L\) denotes the height of the tree index. As the two techniques are applied independently in GMA, \(precision(GMA)\ge 0.25\times N^{L-2}\).

6 Performance study

In this section, we experimentally evaluate the effectiveness of our developed pruning rules, the accuracy of cost models, and the performance of the algorithms for \(\hbox {M}k\hbox {CP}\) retrieval and its variants, using both real and synthetic data sets. The detailed experimental settings are described in Sect. 6.1. Five sets of experiments are conducted. The first set verifies the efficiency of our algorithms compared with the existing state-of-the-art (in-memory) \(\hbox {M}k\hbox {CP}\) search algorithms and \(k\hbox {CP}\) queries in Euclidean spaces, as presented in Sect. 6.2. The second set evaluates the effectiveness of pruning rules, as reported in Sect. 6.3. The third set demonstrates the accuracy of cost models developed for \(\hbox {M}k\hbox {CP}\) retrieval, as described in Sect. 6.4. Sections 6.5 and 6.6 present the last two sets of experiments, which evaluate the performance for two \(\hbox {M}k\hbox {CP}\) query variants, i.e., \(\hbox {SM}k\hbox {CP}\) search and \(\hbox {AM}k\hbox {CP}\) retrieval, respectively.

6.1 Experimental setup

We deploy four real datasets CA, SF, Color, and NASA. CA and SF Footnote 1 represent the locations in California and San Francisco, respectively. Color Footnote 2 contains the first four dimensions of color histograms extracted from an image database. NASA Footnote 3 is a set of feature vectors made by NASA. Following the experimental settings in [46], we generate a Signature dataset, in which each object is a string with 64 English letters. Since \(\hbox {M}k\hbox {CP}\) retrieval involves two object sets, we combine two GIS datasets CA and SF, and employ \(L_{1}\)-norm to simulate the shortest road network distance. However, for datasets Color, NASA and Signature, we divide them into two datasets with the same cardinality [18] , where \(L_{2}\)-norm and edit distance are utilized to measure the distances. Synthetic datasets following Uniform and Gaussian distributions, respectively, are also created, and \(L_{2}\)-norm is employed. Table 2 lists the statistics of the datasets used in our experiments. Uniform and Gaussian datasets with 16 dimensionality, and NASA are indexed using a page size of 10KB, whereas the other datasets are indexed using a page size of 4KB. The distance distribution \(F\) for every real or synthetic dataset is obtained by sampling, and is approximated by an equi-width histogram with 20,000 bins, separately storing values of \(F\)(1), \(F\)(2), and so on.

Table 2 Statistics of the datasets used

We investigate the performance of the proposed algorithms under various parameters, which are summarized in Table 3, where the bold denotes the defaults. Note that, in each experiment, only one factor varies, whereas the others are fixed to their default values. The main performance metrics include the total query cost (i.e., the sum of the I/O time and CPU time, where the I/O time is computed by charging 10ms for each page faults, as with [11]), the selectivity (defined in Sect. 4.4), and the number of node accesses (NA). All algorithms were implemented in C++, and all experiments were conducted on an Intel Core 2 Duo 2.93 GHz PC with 3 GB RAM.

Table 3 Parameter ranges and default values

6.2 Results on \(\hbox {M}k\hbox {CP}\) queries

The first set of experiments evaluates the performance of our presented algorithms (namely RMA, IMA, and EHM) in answering \(\hbox {M}k\hbox {CP}\) search, compared with existing state-of-the-art \(\hbox {M}k\hbox {CP}\) and Euclidean \(k\hbox {CP}\) algorithms. We study the influence of several parameters, including (1) the value of \(k\), i.e., the number of closest pairs required, (2) the cardinalities of datasets, (3) the overlap between two datasets, (4) the ratio of \(e\textit{CPD}_{k}/\textit{CPD}_{k}\), and (5) the buffer size.

Table 4 Comparisons among EHM, PSI, and EPM on Color
Table 5 Comparisons among EHM, AMP and LTC

Effect of \(k\). First, we investigate the impact of \(k\) on the efficiency of the algorithms, using real and synthetic datasets. The results are depicted in Fig. 13, where abbreviations of algorithms (R for RMA, I for IMA, and E for EHM) are shown on the top of each column. The first observation is that the performance of the algorithms increases with the growth of \(k\). This is because, the more closest object pairs we need, the more entry pairs we need to evaluate. The second observation is that EHM performs the best, as it combines the depth-first and best-first traversal paradigms to reduce I/O cost, and utilizes the aggressive pruning and compensation technique to reduce computational cost. Note that, the CPU cost of RMA increases dramatically when \(k\) reaches \(10^{5}\). The reason is that, for larger \(k\), it is more likely for RMA to access unnecessary branches.

Table 6 \(\hbox {M}k\hbox {CP}\) performance versus cardinality on Uniform
Table 7 \(\hbox {M}k\hbox {CP}\) performance versus cardinality on Gaussian
Fig. 13
figure 13

\(\hbox {M}k\hbox {CP}\) search performance versus \(k\). a CA, SF. b CA, SF. c Color, Color. d Color, Color. e Signature, Signature. f Signature, Signature

Since EHM performs the best in most cases, it is utilized to compare against Euclidean \(k\hbox {CP}\) query algorithm PSI [18] and EHM based on PM-tree (EPM) [44], with the results depicted in Table 4 using \(Color\) dataset. It is observed that, PSI performs better than EHM and EPM, whereas it needs more distance computations especially for larger \(k\) values. This is because, PSI is designed particularly for the Euclidean space, where geometric properties can be employed to accelerate search; while EHM and EPM are applicable for any specific metric space, which aims to reduce the number of distance computations, since distance computation in the generic metric space is usually costly. The second observation is that, EPM outperforms EHM in terms of the CPU cost and selectivity, but it needs larger I/O cost. The reason is that, EPM utilizes the pivots with pre-computed distances to improve query efficiency, resulting in larger index storage and larger I/O overhead accordingly.

In addition, we compare our EHM algorithm with state-of-the-art \(\hbox {M}k\hbox {CP}\) query algorithms AMP [29] and LTC [32], using high dimensional real and synthetic datasets. The results are depicted in Table 5, where “\(\backslash \)” denotes the NA of the corresponding algorithms is missing, because AMP and LTC are in-memory methods. The first observation is that EHM performs the best. The reason is that, our approach utilizes several punning rules based on COM-trees, and takes advantage of aggressive pruning and compensation, to improve query efficiency. Note that, the selectivity approaches to 1 on Uniform and Gaussian datasets in a 16 dimensional space. In other words, \(\hbox {M}k\hbox {CP}\) query algorithms degenerate to a brute-force algorithm, which needs to compare all the object pairs from two datasets. Hence, in the rest experiments of this paper, we employ 2 dimensional synthetic datasets.

Effect of cardinality Next, we show the scalability of our algorithms by comparing against existing \(\hbox {M}k\hbox {CP}\) search algorithms AMP [29] and LTC [32], using synthetic datasets. Tables 6 and 7 show the results as a function of \(|P|(=|Q|)\), under Uniform and Gaussian datasets, respectively. Note that, in Tables 6 and 7, “\(\backslash \)” represents the NA of the corresponding algorithms is missing, and “\(-\)” indicates that the corresponding algorithms cannot run due to memory overflow. This is because, both AMP and LTC are in-memory methods, whereas our algorithms are developed based on the disk-based COM-tree, and only load the data needed during \(\hbox {M}k\hbox {CP}\) query processing. As expected, our algorithms performs much better than AMP and LTC. In particular, LTC is several order magnitude worse than other algorithms. This is because, the efficiency of LTC degrades as the overlap percentage of datasets decreases. Here, the overlap percentage is set to 50 % as the default for synthetic datasets. In addition, although EHM is the best in terms of selectivity, it has larger CPU cost than RMA and IMA. The reason is that, the additional CPU cost is needed for EHM in order to use the progressive pruning and compensation technique to further reduce the number of distance computations. Specifically, EHM needs to insert object pairs into the min-heap for further verification even if they are pruned by using the aggressive pruning technique (line 41 of Algorithm 4), i.e., the insertion operation leads to additional CPU cost; while IMA and RMA compute immediately the distances and update the result set.

Effect of overlap Then, in order to explore the influence of overlap on the algorithms, we employ Uniform and Gaussian datasets. Figure 14 depicts the results under various overlap percentages. Notice that, in Fig. 14b, d, the total query cost for overlap = 0 is illustrated in the small sub figure. As expected, the selectivity and the total query cost of all the algorithms ascend with the growth of overlap percentage, because the \(\hbox {M}k\hbox {CP}\) query space grows as overlap increases. Consistent with the observation from previous experiments, EHM also performs the best.

Fig. 14
figure 14

\(\hbox {M}k\hbox {CP}\) search performance versus \(overlap\). a Uniform, Uniform. b Uniform, Uniform. c Gaussian, Gaussian. d Gaussian, Gaussian

Effect of ratio Next, we inspect the impact of the ratio of \(e\textit{CPD}_{k}/\textit{CPD}_{k}\) on the efficiency of the algorithms. Figure 15 illustrates the results on real and synthetic datasets. The first observation is that, as \(e\textit{CPD}_{k}\) approaches to the real \(\textit{CPD}_{k}\) value, the selectivity of EHM drops consistently. When \(e\textit{CPD}_{k}\) grows far beyond the real \(\textit{CPD}_{k}\) value, the selectivity of EHM converges to that of IMA. Note that, for CA and SF datasets, the selectivity of EHM skips from the case when \(e\textit{CPD}_k/\textit{CPD}_k=1\) to the case when \(e\textit{CPD}_k/\textit{CPD}_k=3\), due to the distance distributions of the datasets. Specifically, there exit almost 9000 object pairs with their distances in the range [\(\textit{CPD}_k\), \(3\textit{CPD}_k\)], which is far more than those in other distance intervals (e.g., [\(3\textit{CPD}_k\), \(5\textit{CPD}_k\)]), resulting in a clear increase in terms of the number of distance computations. For Signature, the selectivity of EHM is almost the same as that of IMA, except for ratio = 1, because the distance function for Signature has small domain of discrete values. The second observation is that, the CPU cost of EHM decreases as the ratio grows, and stays stable eventually. This is because additional CPU cost is needed for compensation, if \(e\textit{CPD}_{k}\) is too small. Nonetheless, EHM always outperforms RMA and IMA, with \(e\textit{CPD}_{k}\) in a wide range of estimated values. However, in Fig. 15b, f, IMA performs worse than RMA in terms of total query cost. The reason is that, the recursion used by RMA favors LRU buffers than the iteration used by IMA [18]. Therefore, the I/O cost of IMA is worse than that of RMA, incurring larger total query cost of IMA.

Fig. 15
figure 15

\(\hbox {M}k\hbox {CP}\) search performance versus ratio of \(e\textit{CPD}_k/\textit{CPD}_k\). a CA, SF. b CA, SF. c Color, Color. d Color, Color. e Signature, Signature. f Signature, Signature

Effect of buffer size Finally, we explore the influence of buffer size with respect to the efficiency of the algorithms, using real and synthetic datasets. Figure 16 only shows the total query cost of the algorithms under various buffer sizes, since the selectivity of the algorithms stays unchanged under different buffer sizes. Note that, the I/O costs of IMA and EHM are larger than RMA when the buffer size equals to 256 pages under Signature dataset. The reason is that, recursion in depth-first traversal favors LRU buffers [18]. Although EHM visits entry pairs with \(mindist=0\) in the depth-first order, it iteratively accesses entry pairs with \(mindist>0\). Therefore, the I/O cost of EHM could be larger than that of RMA. More important, nevertheless, in most cases, the I/O cost of EHM is the lowest.

Fig. 16
figure 16

\(\hbox {M}k\hbox {CP}\) search performance versus buffer size. a CA, SF. b Color, Color. c Signature, Signature

6.3 Effectiveness of rules

The second set of experiments aims to evaluate the effectiveness of our developed pruning rules. We measure the effectiveness of a rule by how often it is successfully applied in every algorithm. For Rules 12, a successful application is counted when they prune an intermediate entry or an object. Table 8 depicts the number of times that each rule is successfully applied as a function of \(k\). Obviously, all pruning rules are applied multiple times during \(\hbox {M}k\hbox {CP}\) search, confirming their usefulness. Note that, the number of pruning rules applied increases with \(k\) in most cases. This is because, as \(k\) grows, \(\textit{CPD}_{k}\) used for pruning ascends. Hence, it is more difficult to prune high level entries, resulting in a lot of pruning rule applications for the entries in the low level.

Table 8 Prunning rule effectiveness versus \(k\)

6.4 Accuracy of cost models

The third set of experiments verifies the accuracy of the cost models for \(\hbox {M}k\hbox {CP}\) retrieval. Figure 17 plots the I/O cost (in terms of NA) and the CPU cost (in terms of selectivity) with respect to \(k\). In particular, each diagram contains: (1) the actual cost of EHM, (2) the estimated cost computed by our derived cost models, and (3) the relative error between actual and estimated values (i.e., \(\vert actual-estimated\vert /actual\)). Note that, for clarity, we only include the cost of EHM here because it performs the best, and its cost is the closest to the value derived by the cost model. It is observed that the cost model for I/O cost is very accurate, with the maximum relative error 6.2 %. However, the cost model derived for CPU cost is not accurate yet still good, with the maximal relative error 22.5 %.

Fig. 17
figure 17

Cost model versus \(k\). a CA, SF. b CA, SF. c Color, Color. d Color, Color. e Signature, Signature. f Signature, Signature

6.5 Results on \(\hbox {SM}k\hbox {CP}\) queries

The fourth set of experiments evaluates the performance of our proposed algorithms (namely EHS and MSA) in answering \(\hbox {SM}k\hbox {CP}\) queries. We study the influence of various parameters, including (1) the grouping radius \(r\), (i.e., the percentage with respect to the maximal distance in the metric space), and (2) the value of \(k\), i.e., the number of closest pairs required.

Effect of \(r\) Figure 18 plots the performance of \(\hbox {SM}k\hbox {CP}\) search as a function of \(r\), using real and synthetic datasets, where abbreviations of algorithms (M for MSA and \(r\) for \(r\)-MSA) are shown on the top of each column. The first observation is that, \(r\)-MSA performs much better than MSA. This is because \(r\)-MSA utilizes the grouping technique to minimize computational cost significantly, as discussed in Sect. 5.1.3. The second observation is that, \(r\)-MSA achieves the best performance when \(r\) approaches 6 %. The reason is that, for larger radius, the groups are not well clustered; in worst case, all the objects are partitioned into one group, resulting in poor query efficiency. Nevertheless, for smaller radius, it will result in too many groups and thus needs more additional cost to process these groups.

Fig. 18
figure 18

\(\hbox {SM}k\hbox {CP}\) search performance versus \(r\). a SF. b SF. c Color. d Color. e Signature. f Signature

Effect of \(k\) Next, we inspect the impact of \(k\) on the efficiency of \(\hbox {SM}k\hbox {CP}\) search algorithms. Figure 19 illustrates the experimental results, based on real and synthetic datasets, where abbreviations of algorithms (S for EHS and r for \(r\)-MSA) are shown on the top of each column. The first observation is that, in most cases, \(r\)-MSA performs better than EHS. This is because \(r\)-MSA utilizes the NN information to boost query performance. Note that, the efficiency of \(r\)-MSA degrades dramatically when \(k\) reaches \(10^{5}\), especially on SF and Color datasets. The reason is that, as discussed in Sect. 5.1.3, the time complexity of \(r\)-MSA is \(\hbox {O}(k^{2})\), and thus, the efficiency of \(r\)-MSA degrades as \(k\) increases, especially for larger \(k\). Although \(r\)-MSA utilizes the grouping technique to minimize quadratic cost, the improvement degrades when \(k\) reaches \(10^{5}\), as the upper bound \(max\textit{CPD}_k\) used to avoid unnecessary distance computations converges slowly. However, the efficiency degradation is not obvious on Signature, due to its polarized distance distribution.

Fig. 19
figure 19

\(\hbox {SM}k\hbox {CP}\) search performance versus \(k\). a SF. b SF. c Color. d Color. e Signature. f Signature

6.6 Results on \(\hbox {AM}k\hbox {CP}\) queries

The last set of experiments verifies the performance of our proposed algorithms (namely EHA and GMA) in answering \(\hbox {AM}k\hbox {CP}\) queries. We study the influence of various parameters, including (1) the approximate parameters \(\alpha \) and \(N\), and (2) the value of \(k\), i.e., the number of closest pairs required.

Effect of \(\alpha \) and \(N\) First, we investigate the impact of approximate parameters \(\alpha \) and \(N\) on the efficiency of the algorithms and the accuracy of the final result, compared with Euclidean \(\hbox {A}k\hbox {CP}\) algorithm KCPRH [20], using real and synthetic datasets. In particular, \(\alpha \) is utilized for EHA, while \(N\) is used for GMA, to control the tradeoff between the quality of the query result and the query efficiency. The accuracy, the improvement of selectivity (IS for short), and the improvement of total query cost (ITQC for short) are depicted in Fig. 20, where abbreviations of algorithms (K for KCPRH, A for EHA, and G for GMA) are shown on the top of each column, and Lower_bound denotes the lower bound of precision for GMA algorithm as derived in Sect. 5.2.3. Note that, IS is measured as the ratio of the selectivity of \(\hbox {AM}k\hbox {CP}\) query algorithm and that of \(\hbox {M}k\hbox {CP}\) search algorithm, and IQTC is defined similarly. The first observation is that, the precision of the approximate algorithms drops with \(\alpha \) and \(N\), which is consistently with the precision derived in Sect. 5.2.3. The second observation is that, GMA can achieve better accuracy than KCPRH with similar query efficiency improvement. Moreover, the ITQC of KCPRH is more sensitive to approximate parameters. This is because, KCPRH utilizes the hybrid approximate technique; thus, its total query cost decreases with both parameters \(N\) and \(\alpha \); while GMA is only affected by \(N\). In addition, although EHA can achieve the high accuracy of the final result, query performance does not improve much. Nevertheless, GMA provides much larger query performance improvement, which is sensitive to the approximate parameter, and meanwhile can get the tolerated precision.

Fig. 20
figure 20

\(\hbox {AM}k\hbox {CP}\) search performance versus \(\alpha \) and \(N\). a CA, SF. b CA, SF. c CA, SF. d Color, Color. e Color, Color. f Color, Color. g Signature, Signature. h Signature, Signature. i Signature, Signature

Effect of \(k\) Figure 21 plots the precision and \(\hbox {AM}k\hbox {CP}\) search performance as a function of \(k\), using both real and synthetic datasets. It is worth noting that, EHM is a \(\hbox {M}k\hbox {CP}\) search algorithm, which is used to compare against our proposed \(\hbox {AM}k\hbox {CP}\) search algorithms, to show the query performance improvement. As expected, GMA is better than EHA, since it can find a well-controlled trade-off between the query cost and the accuracy of the result. Note that, the peak of the accuracy (i.e., precision defined in Definition 7) for EHA occurs in Fig. 21a when \(k=10\) due to its randomness for smaller \(k\) values. In addition, the precision of GMA is zero when \(k=1\) in Fig. 21a, since its value can be either 1 or 0 according to Definition 7, and the precision is smaller than its lower bound when \(k=1\).

Fig. 21
figure 21

\(\hbox {AM}k\hbox {CP}\) search performance versus \(k\). a CA, SF. b CA, SF. c CA, SF. d Color, Color. e Color, Color. f Color, Color. g Signature, Signature. h Signature, Signature. i Signature, Signature

6.7 Conclusions from the experiments

From the previous exhaustive performance comparisons on both real and synthetic datasets, the most important conclusions are the following:

  • For \(\hbox {M}k\hbox {CP}\) search, among the algorithms RMA, IMA, and EHM based on M-tree, EHM performs the best in most cases. In addition, our algorithms are flexible, i.e., they can be easily extended to other metric indexes (e.g., PM-tree [44]), in order to achieve better query performance in terms of selectivity and CPU time.

  • Compared with the existing state-of-the-art \(\hbox {M}k\hbox {CP}\) search algorithms AMP [29] and LTC [32], our algorithms are more efficient and better scalable. In particular, the performance of LTC is poor for the case when the datasets with low overlap percentage. However, for the \(k\hbox {CP}\) query in the Euclidean space, the algorithms using R-trees [18] outperforms those using M-trees and PM-trees with respect to I/O cost and CPU time.

  • For \(\hbox {SM}k\hbox {CP}\) retrieval, MSA based on the COMdnn-tree is several orders of magnitude better than EHS based on the COM-tree when \(k\) is much smaller than the cardinality of the dataset.

  • For \(\hbox {AM}k\hbox {CP}\) search, the GMdnn-tree with the \(N\)-consider technique is more suitable for finding a good balance between query performance and query result accuracy, while the \(\alpha \)-allowance technique is a good alternative when the user demands high quality of the result set regardless of necessary processing time.

7 Conclusions

In this paper, we explore the problem of \(\hbox {M}k\hbox {CP}\) search, which aims at efficient \(k\hbox {CP}\) query processing in general metric spaces. \(\hbox {M}k\hbox {CP}\) retrieval is not only interesting from a research point of view, but also useful in many real-life applications (e.g., GIS, data mining, recommender systems). We propose three algorithms (i.e., RMA, IMA, and EHM) that do not require the detailed representations of the objects and are applicable as long as the similarity between two objects can be evaluated and satisfies the triangle inequality. Our methods utilize \(dynamic\) metric indexes (i.e., COM-trees), employ a series of pruning rules, follow depth-first or/and best-first traversal paradigms, and make use of the aggressive pruning and compensation technique. In addition, we develop a cost model for \(\hbox {M}k\hbox {CP}\) search and study two interesting \(\hbox {M}k\hbox {CP}\) query variants. Extensive experiments using both real and synthetic data sets demonstrate the performance of the proposed algorithms, the effectiveness of the presented pruning rules, and the accuracy of the derived cost model.

In the future, we intend to further improve the performance of our presented algorithms, by developing more effective pruning rule(s) and more efficient approach of \(\textit{CPD}_k\) estimation. Another promising direction for future work is to consider other interesting \(k\hbox {CP}\) query variants (e.g., exclusive \(k\hbox {CP}\) queries [48] and \(k\) farthest pair queries) in general metric spaces. Finally, it would be particularly interesting to investigate \(\hbox {M}k\hbox {CP}\) retrieval in the distribution environment.