1 Introduction

Consider the problem of classifying a large number of high dimensional data \(Y=\{y^1,\ldots ,y^M\}\subset {\mathcal {B}}^S\) into high dimensional classes \(X=\{x^1,\ldots ,x^N\}\subset {\mathcal {A}}^S\), given a known joint probability distribution \({\mathbb {P}}(x,y)\), where \({\mathcal {A}}\) and \({\mathcal {B}}\) are discrete alphabets. Given a point \(y\in Y\), the goal is to find the class \(x\in X\) that maximizes \({\mathbb {P}}(y\mid x)\);

$$\begin{aligned} {\arg }\max _{x\in {X}}{\mathbb {P}}(y\mid x), \end{aligned}$$
(1)

where \({\mathbb {P}}(y\mid x)\) is factorizable to i.i.d. components, i.e.,Footnote 1

$$\begin{aligned} {\mathbb {P}}(y\mid x)=\prod _{s=1}^S p(y_s\mid x_s). \end{aligned}$$
(2)

In this paper, we refer to \(X=\{x^1,\ldots ,x^N\}\) as database points and \(Y=\{y^1,\ldots ,y^M\}\) as queries. This problem has applications in various fields, including the clustering of spectra generated by mass spectrometry instruments (Frank et al. 2011). Consider the problem where there are billions of data points (mass spectra) and given a query spectrum y the goal is to find the spectrum x that maximizes known probability distribution \({\mathbb {P}}(y\mid x)\) (Aebersold and Mann 2003; Kim and Pevzner 2014). This is similar to the nearest neighbor search problem (Dasarathy and Sheela 1977; Yianilos 1993), where instead of minimizing a distance, we maximize a probability distribution. Nearest neighbour search problem has applications in machine learning (Duda et al. 1973), database querying (Guttman 1984) and computer vision (Mori et al. 2001; Shakhnarovich et al. 2003) .

A naive approach to solve this problem is to compute \({\mathbb {P}}(y\mid x)\) for each \(x\in X\), and find the maximum. The runtime for this algorithm is O(NS) which is very slow when the number of classes, N, is large.

In order to address this problem where the number of classes is massive, multiclass classification methods have been established (Bentley 1975; Friedman et al. 1977; Prabhu and Varma 2014; Choromanska and Langford 2015; Bhatia et al. 2015; Rai et al. 2015; Jain et al. 2016; Yen et al. 2016; Liu and Tsang 2017; Nam et al. 2017; Tagami 2017; Niculescu-Mizil and Abbasnejad 2017; Zhou et al. 2017). For example, in Choromanska and Langford (2015) an approach to construct a tree with logarithmic (in terms of number of classes) depth is presented and logarithmic train and test time complexity is achieved. However, these methods are limited to low-dimensional data (when the number of dimensions is less than 10), and their complexity grows exponentially with the dimension. The methods in Bentley (1975), Friedman et al. (1977) and Prabhu and Varma (2014) speed up the prediction by rejecting a significant portion of classes for each query. However, all these methods suffer from the curse of dimensionality, i.e., when the dimension of the data increases, the complexity of these methods increases linearly with the number of classes.Footnote 2 Dimension reduction can also be utilized before applying nearest neighbor search algorithms to high dimensional data (Beyer et al. 1999; Castelli et al. 2000; Min 2005; Anagnostopoulos et al. 2015). Moreover, Structure Preserving Embedding (SPE), which is a low-dimensional embedding algorithm for Euclidean space is also used as a dimension reduction tool (Shaw and Jebara 2009). Set similarity in nearest neighbor search literature is studied extensively (Charikar 2002; Christiani and Pagh 2017). In (Christiani and Pagh 2017), the authors develop a data structure that solves the problem of approximate set similarity search under Braun–Blanquet similarity \(B(x,y)=\frac{|x\cap y|}{\max (|x|,|y|)}\) in sub-quadratic query time.

A similar problem has been investigated in the field of nearest neighbor search. In this problem, given a set of points in a database, the goal is to find the point in the database that is closest to a query point. A popular approach to this problem is locality sensitive hashing (Indyk and Motwani 1998; Gionis et al. 1999). This approach solves \(\epsilon \)-approximate nearest neighbor search problem by designing a family of hashes in a way that near points are hashed to the same bucket with probability much higher than random points. In \(\epsilon \)-approximate nearest neighbor search problem, given a query point y, the goal is to find \(x\in X\) for which \(d(x,y)\le (1+\epsilon )d(x',y)\) for all \(x'\in X\) and X is the set of all feasible points (Indyk and Motwani 1998; Bawa et al. 2005). One of the most popular methods for approximate nearest neighbor search is LSH (Indyk and Motwani 1998; Gionis et al. 1999; Shrivastava and Li 2014; Andoni et al. 2017; Christiani and Pagh 2017; Rubinstein 2018). For any metric space \({\mathcal {M}}=(M,d)\), a family of hash functions \(h:{{\mathcal {M}}}\rightarrow S\) is \({\mathcal {F}}(R,cR,{p}_1,{p}_2)\)-sensitive if for any two points \(x,y \in {{\mathcal {M}}}\):

$$\begin{aligned} \text{ If }~ d(x,y)\le & {} R, ~\text{ then }~ h^{{\mathcal {A}}}(x)=h^{{\mathcal {B}}}(y)~\text{ with } \text{ probability } \text{ at } \text{ least }~ {p}_{1} . \end{aligned}$$
(3)
$$\begin{aligned} \text{ If }~ d(x,y)\ge & {} cR, ~\text{ then }~ {h^{{\mathcal {A}}}(x)}=h^{{\mathcal {B}}}(y)~\text{ with } \text{ probability } \text{ at } \text{ most }~ {p}_{2} . \end{aligned}$$
(4)

A family is interesting when \(p_1>p_2\). In the case of hamming distance, i.e., when data are in the form of d-dimensional vectors from \({\{0,1\}}^d\), the family of hashes may be defined simply as \(H=\cup _{i=1}^d\{h:\{0,1\}^{d}\rightarrow \{0,1\}\mid h(x)=x_{i}\}\) where \(x_i\) is i-th coordinate of x. From (3) and (4), we conclude that \(p_1=1-\frac{R}{d}\) and \(p_2=1-\frac{cR}{d}\).Footnote 3

In this paper, in addition to going from minimizing a distance to maximizing a probabilistic measure, our approach differs from the classic LSH in the following way. The proposed hashes in this paper are defined to be a subset of integers instead of an integer and our goal is to ensure thatFootnote 4

$$\begin{aligned} Prob(H^{{\mathcal {A}}}(x)\cap H^{{\mathcal {B}}}(y)\ne \varnothing \mid (x,y)\sim {\mathbb {P}})\ge & {} \alpha , \end{aligned}$$
(5)
$$\begin{aligned} Prob(H^{{\mathcal {A}}}(x)\cap H^{{\mathcal {B}}}(y)\ne \varnothing \mid (x,y)\sim {\mathbb {Q}})\le & {} \beta . \end{aligned}$$
(6)

In words, these hashes hash the jointly-generated pairs of points to the same buckets with probabilities higher than \(\alpha \) while hash random pairs of points to the same buckets with probabilities lower than \(\beta \). Note that, in this case for data points x and y, collision happens when we have \(H^{{\mathcal {A}}}(x)\cap H^{{\mathcal {B}}}(y)\ne \varnothing \), while in the classic LSH, x and y have collision if \(H^{{\mathcal {A}}}(x)= H^{{\mathcal {B}}}(y)\). The idea of defining hash collision as the intersection of hash sets has been previously proposed in Christiani and Pagh (2017).

Currently, the locality sensitive hashing approach does not generalize to the cases where the triangle inequality, i.e., \(d(x,y)\le d(x,z)+d(z,y)\) does not hold (Indyk and Motwani 1998; Gionis et al. 1999; Miltersen 1999; Charikar 2002; Chakrabarti and Regev 2010; Andoni and Razenshteyn 2015; Andoni et al. 2015, 2018). These papers are based on constructing balls on some specific points in the metric space and the notion of balls is well defined only for metric spaces that satisfy triangle inequality. Recently, high dimensional approximate nearest neighbor search in a known probabilistic distribution setting have been investigated in (Bawa et al. 2005; Dubiner 2012). However, currently it is not possible to design efficient algorithms based on these methods, due to the large number of parameters involved.

The problem of finding high dimensional approximate nearest neighbors in a known probabilistic setting using a bucketing tree algorithm has been studied previously in Dubiner (2010) and Dubiner (2012). Dubiner uses a strategy to hash the data points from an arbitrary joint probability distribution into the leafs of the tree in a way that the paired data collide with a probability higher than the random pairs. Here, paired data refers to the pairs coming from a joint probability distribution, i.e., \((x,y)\sim {\mathbb {P}}\) and the random pairs are the pairs coming from an independent probability distribution, i.e., \(x\sim {{\mathbb {P}}}^{{\mathcal {A}}}\) and \(y\sim {{\mathbb {P}}}^{{\mathcal {B}}}\). However, the algorithm introduced in Dubiner requires solving computationally intractable optimizations, making it impossible to implement [e.g. see equation (126) from Dubiner (2012)]. In this paper, for the specific case where the distribution for any \(s\in \{1,2,\ldots ,S\}\) is \( p(x_s=0,y_s=0)=p(x_s=1,y_s=1)=\frac{p}{2},p(x_s=1,y_s=0)=p(x_s=1,y_s=0)=\frac{1-p}{2}\), our theoretical and practical results are compared to Dubiner (2012) for the range of \(0\le p\le 1\).

In this paper, we propose to solve (1) by defining a family of distribution sensitive hashes satisfying the following property. They hash the jointly-generated pairs of points to the same buckets with probabilities much higher than random pairs. We further design an algorithm to solve (1) in sub-linear time using these families of hashes. Next, a method to find optimal family of hashes is presented to achieve minimum search complexity using multiple decision trees. Note that, these decision trees have the same tree structure while they apply to different permutations \(\{1,2,\ldots ,S\}\rightarrow \{1,2,\ldots ,S\}\) of the data. This way, we design forest of decision trees where each decision tree captures a very small ratio \(\alpha \) of true pairs for some \(\alpha \in {\mathbb {R}}^+\) and by recruiting \(\#bands=O(\frac{1}{\alpha })\) independently permuted decision trees we can reach near perfect recovery of all the true pairs. In this paper, we refer to each decision tree as a band and \(\#bands\) is referred to as the number of bands.

The main idea is that we construct decision-tree hashes, in a way that the chance of true pairs hashing to the same leaf nodes in these decision trees is higher than random points (Fig. 1). The decision tree is built in a way that the ratio \(\frac{{\mathbb {P}}(x,y)}{{\mathbb {Q}}(x,y)}\) is higher than a minimum threshold, while the ratios \(\frac{{\mathbb {P}}(x,y)}{{\mathbb {P}}^{{\mathcal {A}}}(x)}\) and \(\frac{{\mathbb {P}}(x,y)}{{\mathbb {P}}^{{\mathcal {B}}}(y)}\) and the number of nodes in the graph are lower than a maximum thresholds, see Algorithm 4. We further determined the optimal tree among many trees that can be constructed in this way. Two theorems are presented here on the complexity of the decision tree built in Algorithm 4.

  1. 1.

    No decision tree exists with the overall search complexity below \(O(N^{\lambda ^*})\) where \(\lambda ^*\) is derived analytically from the probability distribution \({\mathbb {P}}(x,y)\).

  2. 2.

    The decision tree construction of Algorithm 4 described in (57) results in the overall search with complexity \(O(N^{\lambda ^*})\).

Our results show that our approach, Forest-wise distribution sensitive hashing (ForestDSH), provides a universal hash design for arbitrary discrete joint probability distributions, outperforming the existing state of the art approaches in specific range of distributions, in theory and practice. Moreover, we applied this method to the problem of clustering spectra generated from mass spectrometry instruments.

An alternative strategy for solving (2) is to reformulate the problem as minimum inner product search problem (MIPS) (Shrivastava and Li 2014) by transferring the data points into a new space. However, as we show in Sect. 6 the transferred data points are nearly orthogonal to each other making it very slow to find maximum inner product using the existing method (Shrivastava and Li 2014).

Note that, the algorithms presented in this paper are based on the assumption that the true pairs are generated from a known distribution \({\mathbb {P}}\). The distribution \({\mathbb {P}}\) can be learned from a training dataset of true pairs. In practice, training data can be collected by running a brute force search on smaller data sets or portion of the whole data. For example, in case of mass spectrometry search, we collect training data by running brute-force search on small portion of our data (Frank et al. 2011). Note that, we can never perfectly learn the probability distribution that the data is generated from. In Theorem 3, we prove that our algorithm is robust to noise in the distribution \({\mathbb {P}}\) and demonstrate this in an experiment.

Notation The cardinality of a set A is denoted as |A|. The sets \({\mathbb {N}}\) and \({\mathbb {R}}\) stand for the sets of natural and real numbers, respectively. We use \({\mathbb {P}}(\cdot )\) and \({\mathbb {Q}}(\cdot )\) to denote the probability function \(\text{ Prob }(\cdot )\). \({\mathbb {E}}(\nu )\) denotes the expected value of the random variable v. Moreover, we use the notation \(f(x)=O(g(x))\), if \(\limsup _{x\rightarrow \infty }\frac{|f(x)|}{g(x)}<\infty \) and the notation \(f(x)=\varOmega (g(x))\), if \(\limsup _{x\rightarrow \infty }\frac{|f(x)|}{g(x)}>0\). In this paper, \(\log x\) is computed to the base e.

2 Definitions

Definition 1

Define the S-dimensional joint probability distribution \({\mathbb {P}}\) as follows:

$$\begin{aligned} {\mathbb {P}}:{\mathcal {A}}^S\times {\mathcal {B}}^S\rightarrow [0,1],\quad {\mathbb {P}}(x,y)=\prod _{s=1}^S{p}(x_s,y_s), \end{aligned}$$
(7)

where \({\mathcal {A}}=\{a_1,a_2,\ldots ,a_k\}\), \({\mathcal {B}}=\{b_1,b_2,\ldots ,b_l\}\), \(x=(x_1,x_2,\ldots ,x_S)\in {\mathcal {A}}^S\) and \(y=(y_1,y_2,\ldots ,y_S)\in {\mathcal {B}}^S\) and \(k,l<\infty \). On the other hand, we assume that the probability distribution function p(ab) is independent of s and satisfies \(\sum _{a\in [k],b\in [l]}p_{a,b}=1\). Similarly, we define the marginal probability distributions \({\mathbb {P}}^{{\mathcal {A}}}:{\mathcal {A}}^S\rightarrow [0,1]\), \({\mathbb {P}}^{{\mathcal {B}}}:{\mathcal {B}}^S\rightarrow [0,1]\) and \({\mathbb {Q}}:{\mathcal {A}}^S\times {\mathcal {B}}^S\rightarrow [0,1]\) as \({\mathbb {P}}^{{\mathcal {A}}}(x)=\prod _{s=1}^S{p}^{{\mathcal {A}}}(x_s)\), \({\mathbb {P}}^{{\mathcal {B}}}(y)=\prod _{s=1}^S{p}^{{\mathcal {B}}}(y_s)\) and \({\mathbb {Q}}(x,y)=\prod _{s=1}^S{q}(x_s,y_s)\), where

$$\begin{aligned} {p}^{{\mathcal {A}}}(x_s)= & {} \sum _{j=1}^lp(x_s,b_j), \end{aligned}$$
(8)
$$\begin{aligned} {p}^{{\mathcal {B}}}(y_s)= & {} \sum _{i=1}^kp(a_i,y_s), \end{aligned}$$
(9)
$$\begin{aligned} {q}(x_s,y_s)= & {} {p}^{{\mathcal {A}}}(x_s)\times {p}^{{\mathcal {B}}}(y_s). \end{aligned}$$
(10)

We use \(p_{ij}\) instead of \(p(a_i,b_j)\) for simplicity. Moreover, \(p_{i}^{{\mathcal {A}}}\), \(p_j^{{\mathcal {B}}}\) and \(q_{ij}\) are defined as \(\sum _{j=1}^lp_{ij}\), \(\sum _{i=1}^kp_{ij}\) and \(q(a_i,b_j)\), respectively. Finally, we use compact notations \(P=[p_{ij}]\) and \(Q=[q_{ij}]\) as the \(k\times l\) matrices with \(p_{ij}\) and \(q_{ij}\) in their ith row and jth column, respectively.

We define family of distribution sensitive hashes as follows.

Definition 2

(Family of Distribution Sensitive Hashes) Assume that the four scalar parameters \(\alpha \), \(\beta \), \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\) along with the probability distributions \({\mathbb {P}}\), \({\mathbb {P}}^{{\mathcal {A}}}\), \({\mathbb {P}}^{{\mathcal {B}}}\), \({\mathbb {Q}}\) and finite set \(V_{Buckets}\) are given where \({\mathbb {P}}^{{\mathcal {A}}}\) and \({\mathbb {P}}^{{\mathcal {B}}}\) are marginal distributions of \({\mathbb {P}}\) and \({\mathbb {Q}}={\mathbb {P}}^{{\mathcal {A}}}{\mathbb {P}}^{{\mathcal {B}}}\). A family of hashes \( H_z^{{\mathcal {A}}}:{\mathcal {A}}^S\rightarrow 2^{V_{Buckets}}\) and \( H_z^{{\mathcal {B}}}:{\mathcal {B}}^S\rightarrow 2^{V_{Buckets}}\) is called \(({\mathbb {P}},\alpha ,\beta ,\gamma ^{{\mathcal {A}}},\gamma ^{{\mathcal {B}}})\)-distribution sensitive, if the following hold

$$\begin{aligned} \alpha= & {} \sum _{v\in V_{Buckets}}Prob(v\in H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid (x,y)\sim {\mathbb {P}}), \end{aligned}$$
(11)
$$\begin{aligned} \beta= & {} \sum _{v\in V_{Buckets}}Prob(v\in H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid (x,y)\sim {\mathbb {Q}}), \end{aligned}$$
(12)
$$\begin{aligned} \gamma ^{{\mathcal {A}}}= & {} \sum _{v\in V_{Buckets}}Prob(v\in H_z^{{\mathcal {A}}}(x)\mid x\sim {\mathbb {P}}^{{\mathcal {A}}}),\end{aligned}$$
(13)
$$\begin{aligned} \gamma ^{{\mathcal {B}}}= & {} \sum _{v\in V_{Buckets}}Prob(v\in H_z^{{\mathcal {B}}}(y)\mid y\sim {\mathbb {P}}^{{\mathcal {B}}}), \end{aligned}$$
(14)
$$\begin{aligned}&{\mid H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid } \le 1, \end{aligned}$$
(15)

where \(1\le z\le \#bands\) and \(\#bands\) represent the number of bands, while \( V_{Buckets}\) represent a set of indices for the buckets. We show how to choose \(\#bands\) in (51) in Sect. 5 and how to select \(V_{Buckets}\) in Algorithm 3. Intuitively, \(\alpha \) represents the chance of a true pair falling in the same bucket, while \(\beta \) represents the chance of random pairs falling in the same bucket. We will show that \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\) represent the complexity of computing which buckets the data points fall into. In the next section, we will describe how the families of distribution sensitive hashes can be used to design an efficient solution for (1). Note that, in Christiani and Pagh (2017) definitions similar to Definition 2 have been made in order to speed up set similarity search. How the data points are mapped through a decision tree is sketched in Fig. 1. The decision tree and the set of buckets are explained in Algorithm 4. Finally, in Fig. 2, we show how the rate of positive calls for true and random pairs, i.e., \(\alpha \) and \(\beta \) are derived. The average number of times that data points and queries fall in buckets, i.e., \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\) are also derived and shown in this figure.

Remark 1

In classic LSH, we have \(|H^{{\mathcal {A}}}(x)|=1\) and \(|H^{{\mathcal {B}}}(y)|=1\). Therefore, \(\gamma ^{{\mathcal {A}}}=\gamma ^{{\mathcal {B}}}=1\). Our approach is more flexible in the following two ways. First, we allow for \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\) to be larger than one. Moreover, even in the case of \(\gamma ^{{\mathcal {A}}}=\gamma ^{{\mathcal {B}}}=1\), our method can optimize over the larger set of possible hashes.

Fig. 1
figure 1

Distribution sensitive hashing is based on a decision tree G and a set of buckets \(V_{Buckets}\) which are a subset of the leaf nodes of G. Construction of the tree and \(V_{Buckets}\) are explained in Algorithm 4. Here, \(V_{Buckets}\) has six buckets in it. We map data points to buckets by propagating them down the tree, starting from the root using Algorithm 3. If we reach a bucket node, we insert the data point. On the other hand, only a subset of leaf nodes are labeled as bucket and if we reach a leaf node that is not a bucket node, we will simply ignore it. The mapping of and are shown in the figure for \(S=3\). Each of the data points and are mapped to two buckets. Consider data point . Mapping this point by propagating down the tree, we end up with the two buckets \(v_1\) and \(v_3\). Note that, the first character on each of the edges in the paths from root to \(v_1\) and \(v_3\) is 0. Similarly mapping by propagating down the tree, we end up with the two buckets \(v_1\) and \(v_2\) as the second alphabet on the edges should be 0 in all the first two depths and 1 at the third depth (Color figure online)

Fig. 2
figure 2

Positive calls refer to the pairs of data points that are mapped to the same bucket (see Algorithm 2). The rate of positive call for true and random pairs are shown as \(\alpha \) and \(\beta \). The average number of times x and y data points appear in buckets are also shown as \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\). For formal definition of \(\alpha \), \(\beta \), \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\), see Definition 2. Here, we derive the empirical probabilities (11)–(14) in Definition 2. For example, \(\beta \) is derived by counting the number of data points combinations in all the six buckets, i.e., \(2\times 2+1\times 2+2\times 1+1\times 1+1\times 1+1\times 1\) over all the possible number of data points which is \(6\times 7\). Here the size of X is six, and the size of Y is 7. Similarly, \(\gamma ^{{\mathcal {A}}}\) is derived by counting the number of data points from the set X designated to the buckets divided by |X|

Here we present a general algorithm on how ForestDSH algorithm works.

figure a

3 Forest distribution sensitive hashing algorithm

In this section, we assume an oracle has given us a family of distribution sensitive hashes \(H_z^{{\mathcal {A}}},H_z^{{\mathcal {B}}}, 1\le z\le \#bands\) that satisfies (11)–(15). Inspired by LSH method, we present Algorithm 2 for solving (1) using this family.

figure b

Remark 2

Note that, the number of times \({\mathbb {P}}(y\mid x)\) is computed in the brute force method to solve (1) is |X|. Note that, in the optimization problem (1), the point \(y\in Y\) is given. The goal of Algorithm 2 is to solve (1) with a much smaller number of comparisons than the brute force.

Remark 3

In the special case when the hashes are coming from a decision tree, we analyze the complexity of Algorithm 2 in Sect. 5. We show that the number of positive calls in Algorithm 2 is proportional to \(\beta \), while the complexity of computing \(|H_z^{{\mathcal {A}}}(x)|\) and \(|H_z^{{\mathcal {B}}}(y)|\) are proportional to \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\). Moreover, the chance of true pairs being called positive grows with \(\alpha \). Therefore, in the next sections, we attempt to design buckets such that \(\alpha \) is maximized, while \(\beta \), \(\gamma ^{{\mathcal {A}}}\) and \(\gamma ^{{\mathcal {B}}}\) are minimized.

Now, the question is how we can design these families in a way to minimize the complexity, and efficiently map data points to these families. We investigate these questions in Sects. 4 and 4.1 .

4 Designing ForestDSH using a decision tree structure

In the previous section, we assumed that an oracle has given us a family of distribution sensitive hashes. In this section, we design buckets that satisfy (11)–(15) using a forest of decision trees with the same structure. Here, we focus on the probability distributions that can be factorized as the product of i.i.d. components.

Each of our decision trees recovers ratio \(\alpha \) of true pairs and by recruiting \(\#bands=O(\frac{1}{\alpha })\) decision trees we can recover nearly all true pairs. This can be more efficient than using a single decision tree classifier as achieving near perfect true pair recovery by the single decision tree would require near brute-force complexity. By allowing \(\alpha <1\), we can select decision tree that avoid paths with low \({\mathbb {P}}(x,y)\) resulting in complexities much lower than the brute-force search. Note that \(\alpha \) is the true positive rate, e.g. the probability that a true pair (xy) fall in the same bucket, therefore we always have \(\alpha \le 1\), see Fig. 2.

Assume that a decision tree \(G=(V,E,f)\) is given where V is the set of nodes, E is the set of edges, \(V_l\subset V\) is the set of leaf nodes in the decision tree and \(f: V/V_l\times {\mathcal {A}}\times {\mathcal {B}}\rightarrow V\) is the decision function, see Fig. 1. For the two nodes \(v_1,v_2 \in V\), \(v_1\) is called an ancestor of \(v_2\) if \(v_1\) falls within the path from \(v_2\) to root. In this case, \(v_2\) is called a descendant of \(v_1\). Furthermore, assume that a subset of leaf nodes \(V_{Buckets}\subset V_l\) is given and at depth s in the decision tree, the decisions depend only on \(x_s\) and \(y_s\) where \(x=(x_1,\ldots ,x_S)\) and \(y=(y_1,\ldots ,y_S)\) are S-dimensional data point and query, respectively. We define functions \(Seq^{{\mathcal {A}}}:V\rightarrow \cup _{s=0}^S{\mathcal {A}}^s\) and \(Seq^{{\mathcal {B}}}:V\rightarrow \cup _{s=0}^S{\mathcal {B}}^s\) recursively as:

$$\begin{aligned} Seq^{{\mathcal {A}}}({root})= & {} \varnothing \nonumber \\ Seq^{{\mathcal {B}}}({root})= & {} \varnothing \end{aligned}$$
(16)
$$\begin{aligned} Seq^{{\mathcal {A}}}{(f(v,a,b))}= & {} [ Seq^{{\mathcal {A}}}(v),a] \end{aligned}$$
(17)
$$\begin{aligned} Seq^{{\mathcal {B}}}{(f(v,a,b))}= & {} [ Seq^{{\mathcal {B}}}(v),b] \end{aligned}$$
(18)

where [Sa] stands for the concatenation of string S with character a, i.e., for a string \(S=s_1,\ldots ,s_n\) of length n, [Sa] would be \(s_1,\ldots ,s_n,a\) which is a string of length \(n+1\). Moreover, given permutations \(p_z:\{1,\ldots ,S\}\rightarrow \{1,\ldots ,S\}, 1\le z\le \#bands\), \(x=(x_1,\ldots ,x_S)\) and \(y=(y_1,\ldots ,y_S)\) define \(perm_z(x)=(x_{p_z(1)},x_{p_z(2)},\ldots ,x_{p_z(S)})\) and \(perm_z(y)=(y_{p_z(1)},y_{p_z(2)},\ldots ,y_{p_z(S)})\). Finally, the family of buckets \(H_z^{{\mathcal {A}}}(x)\) and \(H_z^{{\mathcal {B}}}(y)\) are defined as

$$\begin{aligned} H_z^{{\mathcal {A}}}(x)= & {} \{v\in V_{Buckets}\mid Seq^{{\mathcal {A}}}(v) ~\text{ is } \text{ a } \text{ prefix } \text{ of } \text{ perm}_z(x)\}, \end{aligned}$$
(19)
$$\begin{aligned} H_z^{{\mathcal {B}}}(y)= & {} \{v\in V_{Buckets}\mid Seq^{{\mathcal {B}}}(v) ~\text{ is } \text{ a } \text{ prefix } \text{ of } \text{ perm}_z(y)\}. \end{aligned}$$
(20)

Note that the permutation \(perm_z\) is a deterministic function of z, i.e., at each band the same permutation is used to randomly permute the data points x and y. These permutations are chosen before mapping the data points. In Fig. 3, we show how both x and y data points are first permuted randomly (using the same permutation) and then are mapped to the buckets in decision trees. Note that, here we call a pair positive, if they fall into the same bucket in at least one of the bands. Now, we show that these hashes are distribution sensitive.

Definition 3

The functions \(\varPhi :V\rightarrow {\mathbb {R}}\), \({\varPsi }^{{\mathcal {A}}}:V\rightarrow {\mathbb {R}}\) and \({\varPsi }^{{\mathcal {B}}}:V\rightarrow {\mathbb {R}}\) are defined as follows. At root, \(\varPhi ({root})=1\), \({\varPsi }^{{\mathcal {A}}}({root})=1\) and \({\varPsi }^{{\mathcal {B}}}({root})=1\), and for \(a_i\in {\mathcal {A}}\), \(b_j\in {\mathcal {B}}\) and \(v\in V\), \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\), and \({\varPsi }^{{\mathcal {B}}}(v)\) are defined recursively as

$$\begin{aligned} \varPhi (f(v,a_i,b_j))= & {} \varPhi (v)p_{ij},\forall v\in V, \end{aligned}$$
(21)
$$\begin{aligned} {\varPsi }^{{\mathcal {A}}}(f(v,a_i,b_j))= & {} {\varPsi }^{{\mathcal {A}}}(v)p_{i}^{{\mathcal {A}}},\forall v\in V, \end{aligned}$$
(22)
$$\begin{aligned} {\varPsi }^{{\mathcal {B}}}(f(v,a_i,b_j))= & {} {\varPsi }^{{\mathcal {B}}}(v)p_j^{{\mathcal {B}}},\forall v\in V. \end{aligned}$$
(23)

Moreover, \(\varPsi :V\rightarrow {\mathbb {R}}\) is defined as

$$\begin{aligned} \varPsi (v)= & {} {\varPsi }^{{\mathcal {A}}}(v){\varPsi }^{{\mathcal {B}}}(v),\forall v\in V. \end{aligned}$$
(24)
Fig. 3
figure 3

With only a single band, many of the true pairs are missed (not called positives). Refer to (50) and (51) in order to see how multiple bands improve true positive rate. While the decision tree are not affected in any band \(1\le z\le \#bands\), classes and queries are permuted by \(perm_z\) (using the same permutation \(perm_z\)) and then mapped to the buckets in decision trees. In this figure, all classes and queries are permuted through three permutations \(perm_1(abc)=bac\), \(perm_2(abc)=cba\) and \(perm_3(abc)=cab\). We showed how three of classes, i.e., and three of queries are permuted. How is mapped through these three trees is shown similar to Fig. 1. In the end, all the classes and queries designated to the buckets are shown. Note that, a pair (xy) is called positive if x and y land in the same bucket in at least one band (Color figure online)

Lemma 1

The following properties hold:

$$\begin{aligned} \varPhi (v)= & {} Prob(v\in H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid (x,y)\sim {\mathbb {P}}), \end{aligned}$$
(25)
$$\begin{aligned} \varPsi (v)= & {} Prob(v\in H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid (x,y)\sim {\mathbb {Q}}), \end{aligned}$$
(26)
$$\begin{aligned} {\varPsi }^{{\mathcal {A}}}(v)= & {} Prob(v\in H_z^{{\mathcal {A}}}(x)\mid x\sim {\mathbb {P}}^{{\mathcal {A}}}), \end{aligned}$$
(27)
$$\begin{aligned} {\varPsi }^{{\mathcal {B}}}(v)= & {} Prob(v\in H_z^{{\mathcal {B}}}(y)\mid y\sim {\mathbb {P}}^{{\mathcal {B}}}). \end{aligned}$$
(28)

Remark 4

Note that the left side of (25)–(28) is independent of z while it seems like the right side depend on band z. The proof of Lemma 1 in “Appendix 1” shows that in fact it is independent of band z.

Lemma 2

For any decision tree G, satisfying the condition that for any pair of buckets \(v_1,v_2\in V_{Buckets}\), \(v_1\) is not an ancestor or descendant of \(v_2\), \(H_z^{{\mathcal {A}}}(x)\) and \(H_z^{{\mathcal {B}}}(y)\) defined in (19) and (20) are \(({\mathbb {P}},\alpha (G),\beta (G),\gamma ^{{\mathcal {A}}}(G),\gamma ^{{\mathcal {B}}}(G))\)-sensitive where

$$\begin{aligned} \alpha (G)= & {} \sum _{v\in V_{Buckets}(G)}\varPhi (v), \end{aligned}$$
(29)
$$\begin{aligned} \beta (G)= & {} \sum _{v\in V_{Buckets}(G)}\varPsi (v), \end{aligned}$$
(30)
$$\begin{aligned} \gamma ^{{\mathcal {A}}}(G)= & {} \sum _{v\in V_{Buckets}(G)}{\varPsi }^{{\mathcal {A}}}(v), \end{aligned}$$
(31)
$$\begin{aligned} \gamma ^{{\mathcal {B}}}(G)= & {} \sum _{v\in V_{Buckets}(G)}{\varPsi }^{{\mathcal {B}}}(v). \end{aligned}$$
(32)

Proofs of Lemmas 1 and 2 are relegated to “Appendix 1”. So far, we showed how to design ForestDSH using a decision tree structure. However, it is not yet clear how to map data points to these buckets. In the next section, we investigate this and provide an algorithm to design optimal decision trees.

4.1 Mapping data points

In the previous sections, we presented Algorithm 2 for solving (1) where we need to compute \(H_z^{{\mathcal {A}}}(x)\) and \(H_z^{{\mathcal {B}}}(y)\) by mapping data points to the buckets. We did not clarify how this mapping can be done efficiently. We present Algorithm 3 for mapping data points to the buckets using hash-table search, see Fig. 2.

figure c

Remark 5

Queries are mapped to buckets using hash-table search through Algorithm 3.

Note that we slightly modify the hash-table to search for values that are prefix of a query, rather than being exactly identical. In Sect. 5, we show that the complexity of Algorithms 2 and 3 can be formulated as:

$$\begin{aligned}&c_{tree}|V(G)|+\left( \frac{c_{hash}N}{\alpha (G)}+\frac{c_{hash}M}{\alpha (G)}\right. \nonumber \\&\quad \left. +\frac{c_{insertion}N\gamma ^{{\mathcal {A}}}(G)}{\alpha (G)} +\frac{c_{insertion}M\gamma ^{{\mathcal {B}}}(G)}{\alpha (G)} +\frac{c_{pos}MN\beta (G)}{\alpha (G)}\right) {\log \frac{1}{1-TP}}, \end{aligned}$$
(33)

where \(c_{tree}\), \(c_{hash}\), \(c_{insertion}\) and \(c_{pos}\) are constants not depending on N. For the intuition behind (33), note that the first term \(c_{tree}|V(G)|\) stands for the time required for calculating and storing the tree. The second and third terms, i.e., \(\left( \frac{c_{hash}N}{\alpha (G)}+\frac{c_{hash}M}{\alpha (G)}\right) {\log \frac{1}{1-TP}}\) denote the time needed for inserting data points to the hash-table. The fourth and fifth terms, i.e., \(\left( \frac{c_{insertion}N\gamma ^{{\mathcal {A}}}(G)}{\alpha (G)}+\frac{c_{insertion}M\gamma ^{{\mathcal {B}}}(G)}{\alpha (G)}\right) {\log \frac{1}{1-TP}}\) stand for the time required for mapping the data points from the hash-table to buckets. Finally, the last term, i.e., \(\left( \frac{c_{pos}MN\beta (G)}{\alpha (G)}\right) {\log \frac{1}{1-TP}}\) is the time of brute-force checking within each bucket.

In order to bound (33) with \(O(N^{\lambda })\) for some \(\lambda \in {\mathbb {R}}^+\), it is necessary and sufficient to find a tree G that satisfies the following constraints:

$$\begin{aligned} |V(G)|= & {} O(N^{\lambda }), \end{aligned}$$
(34)
$$\begin{aligned} \frac{\alpha (G)}{\beta (G)}= & {} \varOmega (N^{1+\delta -\lambda }), \end{aligned}$$
(35)
$$\begin{aligned} \frac{\alpha (G)}{\gamma ^{{\mathcal {A}}}(G)}= & {} \varOmega (N^{1-\lambda }), \end{aligned}$$
(36)
$$\begin{aligned} \frac{\alpha (G)}{\gamma ^{{\mathcal {B}}}(G)}= & {} \varOmega (N^{\delta -\lambda }), \end{aligned}$$
(37)
$$\begin{aligned} {\alpha (G)}= & {} \varOmega (N^{\max (1,\delta )-\lambda }), \end{aligned}$$
(38)

where \({\delta }=\frac{\log M}{\log N}\).

Theorem 1

Complexity of Algorithms 2 and 3 is equal to (33). Moreover, there exists a decision tree G that is optimal and satisfies (34)–(38) for \(\lambda \) defined in Definition 4.

Theorem 1 is proved in three steps. In Sect. 5, we prove that the complexity is equal to (33). Theorem 2 shows that the tree G constructed by Algorithm 4 and presented in Sect. 6 satisfies (34)–(38) for \(\lambda \) defined in Definition 4. Theorem 3 shows that this is optimal.

5 Complexity analysis

In this section, we bound the complexity of Algorithms 2 and 3 by (33). Note that, the complexity of Algorithms 2 and 3 is the summation of the following terms.

  1. 1.

    Tree construction complexity \(c_{tree}|V(G)|\) is the tree construction complexity where \(c_{tree}\) is a constant representing per node complexity of constructing a node and |V(G)| is the number of nodes in the tree.

  2. 2.

    Data mapping complexity The complexity of this hash-table search grows with

    $$\begin{aligned}&c_{hash}(\#bands)|{X}|+c_{insertion}\sum _{z=1}^{\#bands}\sum _{x\in X}|H_z^{{\mathcal {A}}}(x)| \end{aligned}$$
    (39)
    $$\begin{aligned}&+c_{hash}(\#bands)|{Y}|+c_{insertion}\sum _{z=1}^{\#bands}\sum _{y\in Y}|H_z^{{\mathcal {B}}}(y)|, \end{aligned}$$
    (40)

    where \(c_{hash}\) and \(c_{insertion}\) represent complexity of insertion in the hash-table and insertion in the buckets, respectively.

  3. 3.

    Complexity of checking positive calls \((\#bands)c_{pos}\sum _{x\in X,y\in Y}|H_z^{{\mathcal {A}}}(x) \cap H_z^{{\mathcal {B}}}(y) |\big )\) is the complexity of checking positive calls where \(c_{pos}\) is a constant representing the complexity of computing \({\mathbb {P}}(y\mid x)\) for a positive. Note that, \(c_{tree}\), \(c_{hash}\), \(c_{insertion}\) and \(c_{pos}\) are constants not depending on N. From (11)–(15), we have

    $$\begin{aligned}&{\mathbb {E}}\left( \sum _{z=1}^{\#bands}\sum _{x\in X}|H_z^{{\mathcal {A}}}(x) |\right) \nonumber \\&\quad ={{}{\mathbb {E}}\left( \sum _{z=1}^{\#bands}\sum _{x\in X,v\in V_{Buckets}(G)}1_{x\in v}\right) } \end{aligned}$$
    (41)
    $$\begin{aligned}&\quad =(\#bands)N\sum _{v\in V_{Buckets}(G)}{\varPsi }^{{\mathcal {A}}}(v)=(\#bands)N\gamma ^{{\mathcal {A}}}(G) \end{aligned}$$
    (42)
    $$\begin{aligned}&{\mathbb {E}}\left( \sum _{z=1}^{\#bands}\sum _{y\in Y}|H_z^{{\mathcal {B}}}(y) |\right) \nonumber \\&\quad ={\mathbb {E}}\left( \sum _{z=1}^{\#bands}\sum _{y\in Y,v\in V_{Buckets}(G)}1_{y\in v}\right) \end{aligned}$$
    (43)
    $$\begin{aligned}&\quad =(\#bands)M\sum _{v\in V_{Buckets}(G)}{\varPsi }^{{\mathcal {B}}}(v)=(\#bands)M\gamma ^{{\mathcal {B}}}(G). \end{aligned}$$
    (44)

    Note that, the total number of collision for random pairs is the sum of number of buckets that they intersect at. Therefore, we conclude that

    $$\begin{aligned}&{\mathbb {E}}\left( \sum _{z=1}^{\#bands}\sum _{x\in X,y\in Y}|H_z^{{\mathcal {A}}}(x) \cap H_z^{{\mathcal {B}}}(y) |\right) \nonumber \\&\quad =\sum _{z=1}^{\#bands}\sum _{x\in X,y\in Y}Prob\big (|H_z^{{\mathcal {A}}}(x) \cap H_z^{{\mathcal {B}}}(y) |=1\big ) \end{aligned}$$
    (45)
    $$\begin{aligned}&\quad =\sum _{z=1}^{\#bands}{\sum _{x\in X,y\in Y}\sum _{v\in V_{Buckets}(G)}}Prob\big (v\in H_z^{{\mathcal {A}}}(x) ,v\in H_z^{{\mathcal {B}}}(y) \big ) \end{aligned}$$
    (46)
    $$\begin{aligned}&\quad =(\#bands)MN\sum _{v\in V_{Buckets}(G)}\varPsi (v)=(\#bands)MN\beta (G), \end{aligned}$$
    (47)

    where (45) is concluded as \({\mid H^{{\mathcal {A}}}(x)\cap H^{{\mathcal {B}}}(y)\mid }\le 1\).

Now, the question is how we can select \(\#bands\) such that the true positive rate, defined as the ratio of true pairs that are called positive is high. In each band, the chance of a pair \((x,y)\sim {\mathbb {P}}\) being called positive is computed as \(\alpha (G)=\sum _{v\in V_{Buckets}(G)}\varPhi (v)\). Therefore, the overall true positive rate can be computed as:

$$\begin{aligned} TP= & {} Prob((x,y) ~\text{ called } \text{ positive } \text{ in } \text{ Algorithm } \text{2 }~\mid (x,y)\sim {\mathbb {P}}) \end{aligned}$$
(48)
$$\begin{aligned}= & {} 1-\prod _{z=1}^{\#bands}\left( 1-\sum _{v\in V_{Buckets}} Prob(v\in H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid (x,y)\sim {\mathbb {P}})\right) \qquad \end{aligned}$$
(49)
$$\begin{aligned}= & {} 1-{(1-\alpha (G))}^{\#bands}. \end{aligned}$$
(50)

Using (50), and the inequality \((1-x)^{\frac{c}{x}}< e^{-c}\), the minimum possible value of \(\#bands\) to ensure true positive rate TP can be computed as

$$\begin{aligned} \#bands= & {} \lceil \frac{\log \frac{1}{1-TP}}{\alpha (G)}\rceil , \end{aligned}$$
(51)

where \(\lceil r\rceil \) stands for the smallest integer greater than or equal to r. Therefore, the total complexity is equal to (33).

6 Constructing optimal decision trees for ForestDSH

In this section, we present an algorithm to design decision trees with complexity \(O(N^{\lambda ^*})\), where \(\lambda ^*\) is defined below, and we show that it is the optimal decision tree.

Definition 4

Given probability distributions \(P=[p_{ij}]\) and \(Q=[q_{ij}]\), \(1 \le i \le k, 1 \le j \le l\), and number of queries and classes M and N define \(\delta =\frac{\log M}{\log N}\) and

$$\begin{aligned} {\mathcal {I}}= & {} \{(\mu , \nu ,\eta ) \in {\mathbb {R}}^3 | \min (\mu ,\nu )\ge \eta \ge 0,\nonumber \\&\sum _{1 \le i \le k, 1 \le j \le l}p_{ij} ^ {1+\mu +\nu -\eta } {(p_{i}^{{\mathcal {A}}})} ^ {-\mu } {(p_j^{{\mathcal {B}}})} ^ {-\nu } =1 \}, \end{aligned}$$
(52)
$$\begin{aligned} (\mu ^*,\nu ^*,\eta ^*)= & {} \arg \max _{{\mathcal {I}}}\frac{\max (1,\delta )+\mu +\nu \delta }{1+\mu +\nu -\eta }, \end{aligned}$$
(53)
$$\begin{aligned} r_{ij}^*= & {} {p_{ij} ^ {1+\mu ^*+\nu ^*-\eta ^*} {(p_{i}^{{\mathcal {A}}})} ^ {-\mu ^*} {(p_j^{{\mathcal {B}}})} ^ {-\nu ^*}} , \end{aligned}$$
(54)
$$\begin{aligned} n^*= & {} {\frac{(\max (1,\delta )-\lambda ^*) \log N}{\sum r_{ij}^*\log \frac{p_{ij}}{r_{ij}^*}}}, \end{aligned}$$
(55)
$$\begin{aligned} \lambda ^*= & {} \frac{\max (1,\delta )+\mu ^*+\nu ^*\delta }{1+\mu ^*+\nu ^*-\eta ^*}. \end{aligned}$$
(56)

Remark 6

For any probability distribution \({\mathbb {P}}\), the parameters \(\mu ^*\), \(\nu ^*\), \(\eta ^*\) and \(\lambda ^*\) can be derived numerically from Algorithm 5 in “Appendix 2”. The intuition  behind  the  definition  of   \({\mathcal {I}}\) and \((\mu ^*,\nu ^*,\eta ^*)\) is that in Lemma 4 in Sect. 6.1 we show that for any decision tree G and the variables \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\), \({\varPsi }^{{\mathcal {B}}}(v)\) and \(\varPsi (v)\) defined in (21)–(24), we have \(\sum _{v \in V_{Buckets}(G)}{\big (\varPhi (v)\big )}^{1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu +\eta }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu +\eta }{\big (\varPsi (v)\big )}^ {-\eta } \le 1\)

if \((\mu ,\nu ,\eta )\in {\mathcal {I}}\). Moreover, in proof of Theorem 2 we show that \((\mu ^ *,\nu ^ *,\eta ^*)\) are Lagrangian multipliers in an optimization problem to minimize the search complexity in Algorithm 2 while retaining a nearly perfect recovery. Consider the optimal decision tree and all the nodes v satisfying

  1. 1.

    Depth of the node v is \(n^*\).

  2. 2.

    For any \(1\le i\le k,1\le j\le l\), the ratio of times we have \(a_i\) at s-th position of x and \(b_j\) at s-th position of y in all \(1\le s\le n^*\) is \(r_{ij}\).

The node v or one of its ancestors is designated as a bucket by Algorithm 4. This is proved in Sect. 6.2.

In Algorithm 4, we provide an approach for designing decision trees with complexity \(O(N^{\lambda ^*})\). The algorithm starts with the root, and at each step, it either accepts a node as a bucket, prunes a node, or branches a node into kl children based on the following constraintsFootnote 5:

$$\begin{aligned} \left\{ \begin{array}{cl}\frac{\varPhi (v)}{\varPsi (v)} \ge C_1 {N ^ {1+\delta -\lambda ^*}}&{}: \text{ Accept } \text{ bucket },\\ \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {A}}}(v)}\le C_2N^{1-\lambda ^*}&{}: \text{ Prune },\\ \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {B}}}(v)}\le C_3N^{\delta -\lambda ^*}&{}: \text{ Prune, }\\ otherwise&{}: \text{ Branch } \text{ into } \text{ the } kl \text{ children. } \end{array}\right. \end{aligned}$$
(57)

Note that, for the cases where \(p_{ij}=0\), we simply remove that branch, therefore, we assume all the \(p_{ij}\) are positive real numbers. In Sect. 5, we prove that in order to bound the complexity in (33) with \(O(N^{\lambda })\) for some \(\lambda \in {\mathbb {R}}^+\), it is necessary and sufficient to find a decision tree G that satisfies the constraints (34)–(38).

Remark 7

In Theorem 3, we prove that the decision tree construction of Algorithm 4 results in a tree with complexity \(O(N^{\lambda ^*})\) by setting \(C_1=C_2=C_3=p_0q_0\) where \(p_{0}\) and \(q_{0}\) are defined as \(\prod _{i,j}p_{ij}\) and \(\min (\prod _{i,j}q_{ij},\prod _{i}{(p_i^{{\mathcal {A}}})}^{l},\prod _j{(p_j^{{\mathcal {B}}})}^{k})\). Pessimistic lower bounds \(C_1=C_2=C_3=p_0q_0\) are derived in proof of Theorem 3 and in practice \(C_1\), \(C_2\) and \(C_3\) are chosen in a way that the best complexity is achieved.

Example 1

Here, we focus on the case where \({\mathcal {A}}=\{0,1\},{\mathcal {B}}=\{0,1\}\), \(P=\begin{bmatrix} 0.4 &{} 0.3 \\ 0.1 &{} 0.2 \end{bmatrix}\), \(Q=\begin{bmatrix} 0.35 &{} 0.35 \\ 0.15 &{} 0.15 \end{bmatrix}\), \(\delta =1\), and \(M=N=4\). From Algorithm 5, we have \(\mu ^*=12.0791\), \(\nu ^*=13.4206\), \(\eta ^*=11.0959\), \(\lambda ^*=1.7203\). The decision tree is constructed from Algorithm 4 and is depicted in Fig. 4. The nodes in the tree that are selected as bucket, i.e., satisfying \(\frac{\varPhi (v)}{\varPsi (v)}\ge C_1{N^{1+\delta -\lambda ^*}}\), are shown with a green check mark, and the pruned nodes, satisfying either \(\frac{\varPhi (v)}{{\varPsi }^{{\mathcal {A}}}(v)}\le C_2N^{1-\lambda ^*}\) or \(\frac{\varPhi (v)}{{\varPsi }^{{\mathcal {B}}}(v)}\le C_3N^{\delta -\lambda ^*}\) are shown with a red cross. For the non-leaf (intermediate) nodes, none of the above constraints holds.

Fig. 4
figure 4

The decision tree and functions \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\), \({\varPsi }^{{\mathcal {B}}}(v)\) and \(\varPsi (v)\) are illustrated for \({\mathcal {A}}=\{0,1\},{\mathcal {B}}=\{0,1\}\), \(C_1=C_2=C_3=0.8\) and \(N=5\). For all of the nodes \(v_1\), \(v_2\) and \(w_1\), how the decisions are made are explained in the figure. For instance, consider the node \(w_1\). For this node, \(\varPhi (w_1)\), \({\varPsi }^{{\mathcal {A}}}(w_1)\), \({\varPsi }^{{\mathcal {B}}}(w_1)\) are derived from (21)–(23). On the other hand, from (57) this node is accepted as a bucket as \(\frac{\varPhi (w_1)}{\varPsi (w_1)}=1.31\ge 0.8N^{2-\lambda ^*}=1.26\). Note that, \(\lambda ^*=1.72\) from (56) and Algorithm 5

figure d

Theorem 2

No decision tree exists with overall complexity below \(O(N^{\lambda ^*})\).

Theorem 3

The decision tree construction of Algorithm 4 described in Remark 7 results in a tree with complexity \(O(N^{\lambda ^*})\).

In other words, Theorem 3 proves that the tree G constructed by Algorithm 4 satisfies (34)–(38), and Theorem 2 shows that this is the optimal decision tree. For proofs of Theorems 2 and 3 , see Sects. 6.1 and 6.2 . Note that, not only Theorem 3 guarantees that the number of the nodes in our decision tree is bounded by \(O(N^{\lambda ^*})\) but also it guarantees that the runtime for mapping the data points to the decision tree and the number of comparisons that we need to do for the nodes with the collision is bounded by \(O(N^{\lambda ^*})\), see complexity equation (33) in Sect. 4.1.

Theorem 4

(Noise Robustness) Assume that the decision tree described in Algorithm 4 is constructed based on distribution \(P=[p_{ij}]\) while the data is actually generated from an unknown distribution \(P'=[p'_{ij}]\). Assume the distribution \(P'=[p'_{ij}]\) is satisfying \(\frac{p_{ij}}{1+\epsilon }\le p'_{ij}\le {p_{ij}}{(1+\epsilon )}\) for all \(1\le i\le k, 1\le j\le l\) and some \(\epsilon >0\). Then, it is possible to design an algorithm with complexity \(O(N^{\lambda ^{*}(p)+3c_d\log (1+\epsilon )})\) where \(c_d= \frac{{(\lambda ^*-\min (1,\delta ))}}{\log (\max _{i,j}\min (\frac{p_{ij}}{p_i^{{\mathcal {A}}}},\frac{p_{ij}}{p_j^{{\mathcal {B}}}}))}\) while maintaining arbitrary high true positive rate TP.

In other words, Theorem 4 proves that the tree G constructed by Algorithm 4 is noise robust, i.e., for the case where our understanding from the distribution is not exact, the arbitrary high true positive rate TP can still be maintained while the complexity increases linearly with the noise. For proof of Theorem 4, see Sect. 6.3. For a high level intuition behind the relationship between Algorithm 4 and Theorems 2 and 3 , note that: Every constraint in the decision tree defined by constraints (57) directly relates to the constraints (34)–(38). Therefore, we expect this decision tree to be optimal.

Here, we present an intuition behind proof of Theorem 2. For rigorous proof of Theorem 2, see “Appendix 3”.

6.1 Intuition behind proof of Theorem 2

Note that, from definition of \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\), \({\varPsi }^{{\mathcal {B}}}(v)\) and \(\varPsi (v)\) in (21)–(24), we have \(\varPhi (w_{ij})=p_{ij}\varPhi (v)\) where v is the parent of \(w_{ij}\). Similar equations hold for \({\varPsi }^{{\mathcal {A}}}(v)\), \({\varPsi }^{{\mathcal {B}}}(v)\) and \(\varPsi (v)\). On the other hand, from (52), note that for any \((\mu ,\nu ,\eta )\in {\mathcal {I}}\) we have \(\sum _{1 \le i \le k, 1 \le j \le l}p_{ij} ^ {1+\mu +\nu -\eta } {(p_{i}^{{\mathcal {A}}})} ^ {-\mu } {(p_j^{{\mathcal {B}}})} ^ {-\nu } =1 \). Therefore, we expect to have similar relation between \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\), \({\varPsi }^{{\mathcal {B}}}(v)\), \(\varPsi (v)\), i.e.,

$$\begin{aligned} \sum _{v \in V_{Buckets}(G)}{\big (\varPhi (v)\big )}^{1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu +\eta }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu +\eta }{\big (\varPsi (v)\big )}^ {-\eta } \le 1, \end{aligned}$$
(58)

for any \((\mu ,\nu ,\eta )\in {\mathcal {I}}\). On the other hand, from (29)–(32), we have \(\alpha (G)=\sum _{v\in V_{Buckets}(G)}\varPhi (v)\). Similar definitions hold for \(\beta (G)\), \(\gamma ^{{\mathcal {A}}}(G)\) and \(\gamma ^{{\mathcal {B}}}(G)\). Therefore, from convexity of the function \(f(\theta ,\theta _1,\theta _2,\theta _3)={\theta } ^ {1+\rho _1+\rho _2+\rho _3} {\theta _1} ^{-\rho _1} {\theta _2}^ {-\rho _2} {\theta _3} ^{-\rho _3}\) we conclude

$$\begin{aligned} 1\ge {\big (\alpha (G)\big )}^ {1+\mu ^*+\nu ^*-\eta ^*} {\big (\gamma ^{{\mathcal {A}}}(G)\big )}^{-\mu ^ *+\eta ^*} {\big (\gamma ^{{\mathcal {B}}}(G)\big )}^{-\nu ^ *+\eta ^*} {\big (\beta (G)\big )}^{-\eta ^ *}, \end{aligned}$$
(59)

using the lower bounds on \(\frac{\alpha (G)}{\beta (G)}\), \(\frac{\alpha (G)}{\gamma ^{{\mathcal {A}}}(G)}\), \(\frac{\alpha (G)}{\gamma ^{{\mathcal {B}}}(G)}\) and \({\alpha (G)} \) in (34)–(38) and (59), we conclude \(\lambda \ge \lambda ^*\).

Here, we present an intuition behind proof of Theorem 3. For rigorous proof of Theorem 3, see “Appendix 4”.

6.2 Intuition behind proof of Theorem 3

Here, our goal is to prove that the tree construction steps in Algorithm 4, i.e., (57) result in a tree that satisfies the following three statements:

  1. 1.

    There is at least one node in the tree which is accepted as a bucket.

  2. 2.

    [(35)–(38)] holds.

  3. 3.

    Number of nodes in the tree is bounded by \(O(N^{\lambda ^*})\).

Let us intuitively prove all these three statements one by one.

  1. 1.

    Consider a node with the depth equal to \(n^*\) given in (55). Moreover, assume that the number of times we have \(a_i\) at sth position of x and \(b_j\) at sth position of y for all \(1\le s\le n^*\) are \(n^*r^*_{ij}\). Then, we argue that this node is not pruned and accepted as a bucket. In order to understand why intuitively it is true, we verify that this node is accepted as a bucket, i.e., we have to verify that \(\frac{\varPhi (v)}{\varPsi (v)} \ge {N ^ {1+\delta -\lambda ^*}}{p_0q_0}\). This is true as

    $$\begin{aligned} \frac{\varPhi (v)}{\varPsi (v)}= & {} \varOmega (e^{\sum _{i,j} n^*r^*_{ij}\log \frac{p_{ij}}{q_{ij}}}) \end{aligned}$$
    (60)
    $$\begin{aligned}\ge & {} \varOmega ({N ^ {1+\delta -\lambda ^*}}), \end{aligned}$$
    (61)

    (60) follows from (21), (22) and the definition of node v, i.e., the number of times we have \(a_i\) at s-th position of x and \(b_j\) at s-th position of y are \(n^*r^*_{ij}\). (61) is concluded from the definition of \(r^*_{ij}\) in (54). For further details, see “Appendix 4”.

  2. 2.

    Let us prove (35) as the rest of [(36)–(38)] follow similarly. From Algorithm 4, for all the buckets we have

    $$\begin{aligned} \frac{\varPhi (v)}{\varPsi (v)}\ge & {} {N ^ {1+\delta -\lambda ^*}}p_0q_0. \end{aligned}$$
    (62)

    Note that, at least there is one node that is accepted a a bucket. On the other hand, \(\alpha (G)=\sum _{v\in V_{Buckets}(G)}\varPhi (v)\) and \(\beta (G)=\sum _{v\in V_{Buckets}(G)}\varPsi (v)\). Therefore, we conclude that

    $$\begin{aligned} \frac{\alpha (G)}{\beta (G)}= & {} \frac{\sum _{v\in V_{Buckets}(G)}\varPhi (v)}{\sum _{v\in V_{Buckets}(G)}\varPsi (v)} \ge N^{1+\delta - \lambda ^*}p_{0}q_{0}. \end{aligned}$$
    (63)

    Note that \(\frac{\sum _i{a_i}}{\sum _ib_i}\ge c\) if \(\frac{a_i}{b_i}\ge c\) and \(b_i>0\) for any i.

  3. 3.

    Finally, we prove that number of nodes in the tree is bounded by \(O(N^{\lambda ^*})\). Note that \(\sum _{i,j}p_{ij}=1\), therefore, we expect to have \(\sum _{v\in V_l(G)}\varPhi (v)=1\). On the other hand, \(\varPhi (v)\) is expected to be greater than \(N^{-\lambda ^*}\) for the intermediate nodes from (57). Therefore, it is concluded that \(|V_l(G)|\) is at most \(N^{\lambda ^*}\). This results in \(|V(G)|=O(N^{\lambda ^*})\) as for any decision tree we have \(|V(G)|\le 2|V_l(G)|\). For details, see “Appendix 4”.

.

Below, we present an intuition behind proof of Theorem 4. For rigorous proof of Theorem 4, see “Appendix 5”.

6.3 Intuition behind proof of Theorem 4

Define \(\varPhi '(v), \alpha '(G), \beta '(G), \#bands'\) for \(p'_{ij}\) the same way as \(\varPhi (v), \alpha (G), \beta (G), \#bands\) for \(p_{ij}\). As \(\frac{p_{ij}}{1+\epsilon }\le p'_{ij}\le {p_{ij}}{(1+\epsilon )}\), we expect \(\varPhi (v)\) to be bounded as

$$\begin{aligned} \varPhi (v)\frac{1}{{(1+\epsilon )}^d}\le \varPhi '(v)\le \varPhi (v){(1+\epsilon )}^d \end{aligned}$$
(64)

where \(d=depth(G)\). Therefore, from (29) we have

$$\begin{aligned} \alpha (G)= & {} \sum _{v\in V_{Buckets}(G)}\varPhi (v) \end{aligned}$$
(65)
$$\begin{aligned}\le & {} \sum _{v\in V_{Buckets}(G)}\varPhi '(v) {(1+\epsilon )}^d \end{aligned}$$
(66)
$$\begin{aligned}\le & {} \alpha '(G) (1+\epsilon )^d. \end{aligned}$$
(67)

We conclude similar inequalities for \(\gamma ^{{\mathcal {A}}}(G)\) and \(\gamma ^{{\mathcal {B}}}(G)\), while for \(\beta (G)\) we have

$$\begin{aligned} \beta '(G) (1+\epsilon )^{2d}\le \beta (G)\le \beta '(G) \frac{1}{(1+\epsilon )^{2d}}. \end{aligned}$$
(68)

From (51), \(\#bands\) is inversely related to \(\alpha (G)\). Therefore, \(\#bands'\) can be bounded from above by \({(1+\epsilon )}^d\#bands\). As a result, from (33) the total complexity is bounded by \({(1+\epsilon )}^{3d}N^{\lambda ^*}\) from above. Assume that d is bounded by \(c_d\log N\) for some constant \(c_d\). Therefore, we conclude Theorem 4. In order to see why \(d\le c_d\log N\), note that for all the leaf nodes, we have \(\frac{\varPhi (v)}{{\varPsi }^{{\mathcal {A}}}(v)}\le {N ^ {1+\delta -\lambda ^*}}\max _{i,j}\frac{p_{ij}}{p_i^{{\mathcal {A}}}}p_0q_0\). On the other hand, \(\frac{\varPhi (v)}{{\varPsi }^{{\mathcal {A}}}(v)}\) can be bounded by \({\big ( \min _{i,j}\frac{p_{ij}}{p_i^{{\mathcal {A}}}}\big )}^d\) from below. Therefore, we conclude that \(d=c_d\log N\) for some constant \(c_d\).

7 Experiments and observations

Experiment 1

In this experiment, we compare the complexity of ForestDSH with the algorithm proposed by Dubiner in Dubiner (2012). Here, we set \({\mathcal {A}}={\mathcal {B}}=\{0,1\}, S=1000, M=N=1{,}000{,}000\). For ForestDSH, we use Algorithm 5 to derive \(\mu ^*,\nu ^*,\eta ^*,\lambda ^*\) and \(\delta \) while training \(C_1\), \(C_2\) and \(C_3\) to get the best complexity in Algorithm 4. For Dubiner’s algorithm, equation (126) in Dubiner (2012) is computationally intractable which makes it impossible to compute the complexity for the algorithm presented there for general probability distributions. However, in the special case where \(P(p)=\begin{bmatrix} \frac{p}{2} &{} \frac{1-p}{2} \\ \frac{1-p}{2} &{} \frac{p}{2} \end{bmatrix}\), \(0.5\le p\le 1\) (hamming distance), a solution has been provided for computing the complexity in Dubiner (2012). We implemented that solution (see “Appendix 6” for the detail of implementation), and compared it to ForestDSH. Note that currently no source code for the implementation of Dubiner algorithm is available. Figure 5, shows that Dubiner algorithm’s performance is worse than that of ForestDSH.

Fig. 5
figure 5

Comparing the practical performances of Dubiner algorithm in Dubiner (2012) with ForestDSH for \(S=1000\). ForestDSH outperforms Dubiner’s method for all values of p. The pseudo code for the algorithm presented in Dubiner (2012) for the case of hamming distance is generated in this paper in Algorithm 6 in “Appendix 6”

Observation 1

In this observation, we reformulate the problem of solving (2) to the minimum inner product search problem (MIPS) (Shrivastava and Li 2014) by transferring the data points from \({\mathcal {A}}^S\) and \({\mathcal {B}}^S\) to \(R^{klS}\) in a way that \(\log \big (\frac{{\mathbb {P}}(x,y)}{{\mathbb {Q}}(x,y)}\big )\) is equal to the dot product in this new space. We transformed \(x\in {\mathcal {A}}^S\) and \(y\in {\mathcal {B}}^S\) to \(T(x)\in {\mathbb {R}}^{klS}\) and \(T(y)\in {\mathbb {R}}^{klS}\) as follows:

$$\begin{aligned} T(x)= & {} \big (f_{s,i,j}\big ),1\le s\le S,1\le i\le k,1\le j\le l, \end{aligned}$$
(69)
$$\begin{aligned} f_{s,i,j}= & {} \left\{ \begin{array}{cl} \frac{\log \big (\frac{p_{ij}}{q_{ij}}\big )}{\omega _{ij}}&{} \text{ if }~x_s=a_i\\ 0&{} \text{ o.w. }\end{array}\right. , \end{aligned}$$
(70)
$$\begin{aligned} T(y)= & {} \big (g_{s,i,j}\big ),1\le s\le S,1\le i\le k,1\le j\le l,\end{aligned}$$
(71)
$$\begin{aligned} g_{s,i,j}= & {} \left\{ \begin{array}{cl}{\omega _{ij}}&{} \text{ if }~y_s=b_j\\ 0&{} \text{ o.w. }\end{array}\right. . \end{aligned}$$
(72)

Then, we have \(\log \left( \frac{{\mathbb {P}}(x,y)}{{\mathbb {Q}}(x,y)}\right) =<T(x),T(y)>\) where \(<.,.>\) stands for the inner product in \({\mathbb {R}}^{klS}\). In other words, given any \(x=(x_1,\ldots ,x_S)\), each \(x_s\) is transformed into a \(kl\times 1\) vector with l non-zero elements. Similarly, given any \(y=(y_1,\ldots ,y_S)\), each \(y_s\) is transformed to a \(kl\times 1\) vector with k non-zero elements. Therefore, finding pairs of data points with large \(\frac{{\mathbb {P}}(x,y)}{{\mathbb {Q}}(x,y)}\) is equivalent to finding transformed data points with large dot product. Using this transformation, in “Appendix 7” we show that the angle between both the true pairs and false pairs will be nearly \(\frac{\pi }{2}\) for almost all the probability distributions [\(\frac{S_0}{M^2}\approx 0\), using the notation from Shrivastava and Li (2014)]. It is well known that MIPS performs poorly in detection of pairs that are nearly orthogonal (Shrivastava and Li 2014). Therefore, solving (2) by transforming it to a MIPS problem and using existing approaches fails. Note that, Shrivastava and Li (2014) is based on data independent hashes for Euclidean distance that are not the state of the art. Recently, data dependent hashes have been introduced for Euclidean distance that improve on their data independent counterparts (Andoni et al. 2017; Rubinstein 2018), while currently there is no data dependent strategy for maximum inner product search. Therefore, MIPS is currently unable to solve (1) efficiently.

Experiment 2

In this experiment, we compared the complexity for the three algorithms LSH-hamming, MinHash and ForestDSH for a range of probability distributions. We benchmark the three methods using matrices \({P}(t)={P}_1(1-t)+{P}_2t\) where \(0\le t\le 1\), \({P}_1=\begin{bmatrix} 0.345 &{} 0 \\ 0.31 &{} 0.345 \end{bmatrix}\), \({P}_2=\begin{bmatrix} 0.019625 &{} 0 \\ 0.036875 &{} 0.9435 \end{bmatrix}\), and \(\delta =1\), i.e., \(M=N\). The selection of \({P}_1\) was such that the complexity of MinHash minus the complexity of LSH-hamming was maximized. \({P}_2\) was selected such that the complexity of LSH-hamming minus the complexity of MinHash was maximized. Figure 6a shows the theoretical complexities of MinHash, LSH-hamming and ForestDSH for each matrix. See “Appendix 8”, for the details on the derivation of complexities for MinHash, LSH-hamming and ForestDSH. For instance, for \({P}_1=\begin{bmatrix} 0.345 &{} 0 \\ 0.31 &{} 0.345 \end{bmatrix}\), the theoretical per query complexities of MinHash, LSH-hamming and ForestDSH are equal to 0.5207, 0.4672 and 0.4384, respectively. We further consider N data points of dimension S, \(\{x^1,\ldots ,x^N\}\) and \(\{y^1,\ldots ,y^N\}\) where each \((x^i,y^i)\) is generated from P(t), and \(x^i\) is independent from \(y^j\) for \(i\ne j\) \((N=2000,S=2000)\). Then, we used ForestDSH, LSH-hamming and MinHash to find the matched pairs. In each case, we tuned \(\#rows\)Footnote 6 and \(\#bands\) to achieve \(99\%\) true positive (recall) rate. Total simulation time for each of the three methods is plotted for each probability distribution in Fig. 6b. The simulation times in Fig. 6b are consistent with the theoretical guarantees in Fig. 6a. Figure 6b, c show that for sparse matrices, \((t\approx 1)\), MinHash and ForestDSH outperform LSH-hamming. In denser cases, \((t\approx 0)\), LSH-hamming and ForestDSH outperform MinHash. For \((t\le 0.4)\), ForestDSH outperforms both MinHash and LSH-hamming. Note that, for the case when the data is coming from sparse distribution, MinHash beats ForestDSH in practice (as opposed to Theory). This is because for sparse data the total complexity tends to its minimum, i.e., O(N). ForestDSH is inefficient compared to MinHash for small number of N and when the data is sparse as the constant terms and logN terms which play a significant role in this case are not optimized in ForestDSH. In Fig. 7, we further plotted V(G(N)), \(\frac{\alpha (G(N))}{\beta (G(N))}\), \(\frac{\alpha (G(N))}{\gamma ^{{\mathcal {A}}}(G(N))}\) and \(\frac{\alpha (G(N))}{\gamma ^{{\mathcal {B}}}(G(N))}\) as a function of N for trees G(N) constructed by Algorithm 4 for \({P}(t=0.25)\), where \(M=N\). As predicted by Theorem 3, we observed that these quantities grow/decay proportional to \(N^{\lambda ^*}\), \(N^{1+\delta -\lambda ^*}\), \(N^{1-\lambda ^*}\) and \(N^{\delta -\lambda ^*}\), respectively. In Fig. 8, true positive rate and total complexity are plotted in terms of \(\#bands\) for the case of \(N=20{,}000\), \(S=10{,}000\) and the probability distribution \({P}_2\). From (50), i.e., \(TP=1-{(1-\alpha (G))}^{\#bands}\), we expect true positive rate to increase while \(\#bands\) increases (see Fig. 8a). On the other hand, from (51) and (33), total complexity \((N^{\lambda ^*})\) is expected to linearly increase while \(\#bands\) increases (see Fig. 8b).

Fig. 6
figure 6

Total (opposed to per query) complexities of LSH-hamming, MinHash, and ForestDSH are plotted for all the probability distribution matrices \({P}(t)={P}_1(1-t)+{P}_2t\) where \(0\le t\le 1\). a Theoretical guarantees, b simulation time for \(N=2000\) and \(S=2000\), c simulation time for \(N=20{,}000\) and \(S=2000\). Note that ForestDSH performs faster than MinHash if the data is not sparse. While for \(N=2000\) MinHash is superior to ForestDSH on sparse data, when N = 20,000, ForestDSH is performing the same as MinHash. This is mainly due to the fact that constant and \(\log N\) terms vanish compared to N as N grows

Fig. 7
figure 7

In this figure, V(G(N)), \(\frac{\alpha (G(N))}{\beta (G(N))}\), \(\frac{\alpha (G(N))}{\gamma ^{{\mathcal {A}}}(G(N))}\) and \(\frac{\alpha (G(N))}{\gamma ^{{\mathcal {B}}}(G(N))}\) are depicted as a function of N and constants \(c_1\), \(c_2\), \(c_3\) and \(c_4\) not depending on N. The parameters are chosen from Experiment 2. This figure shows how these functions grow/decay proportional to \(N^{\lambda ^*}\), \(N^{1+\delta -\lambda ^*}\), \(N^{1-\lambda ^*}\) and \(N^{\delta -\lambda ^*}\), respectively. (See proof of Theorem 3 in Sect. 6.2 for the justifications. This confirms [(34)–(38)] for trees G(N) constructed by Algorithm 4)

Fig. 8
figure 8

a True positive rate and b total complexity \((N^{\lambda ^*})\) are plotted for the probability distribution matrix \({P}_2(t)\), \(N=2000\) and \(S=2000\) in terms of \(\#bands\)

Experiment 3

In this experiment, we examine the robustness of our algorithm from Theorem 4 for matrices considered in Experiment 2 while \(M=N=300{,}000\). For each P(t), we generated random matrices \({P}'(t)\) within \(1+\epsilon \) of P(t), see Theorem 4. We derive the practical complexity and theoretical complexity, i.e., \(\lambda ^{*}\) from Definition 4. Moreover, for each of these matrices, we derive the worst case complexity in the case of \(\epsilon =0.03\). This is sketched in Fig. 9.

Fig. 9
figure 9

In this figure, practical, theoretical and noisy complexity are compared for \({P}(t)={P}_1(1-t)+{P}_2t\) given in Experiment 4. Here, \(M=N=300{,}000\) and the noisy complexity is derived from Theorem 4 with \(\epsilon =0.03\)

Experiment 4

In this experiment, we used the mass spectral data from human microbiome (McDonald et al. 2018), extracted using MSCluster (Frank et al. 2011). In this problem, there are N spectra \(X={x^1,\ldots ,x^N}\), and a query spectrum y. Our goal is to find the spectra \(x^i\) that maximize a probabilistic model \({\mathbb {P}}(y|x^i)\). \({\mathbb {P}}(y|x)\) is learned assuming that it can be factorized to i.i.d. components. To learn \(p(y\mid x)\), each mass spectra is sorted based on the peak intensities, and in order to reduce the number of parameters we need to learn, instead of the peak ranks we use log of the peak ranks.Footnote 7 Using these data, we learn the joint probability distribution \(p(\log Rank(y_s)=i\mid \log Rank(x_s)=j)\). Among 90,753 data points, we used 70,753 for training, and 20,000 for the test. The joint probability distribution of matching pairs are shown in Fig. 10, before and after logRank transformation.

After learning this probabilistic model, we applied ForestDSH method to the test mass spectra (after \(\log Rank\) transformation), and we were able to speed up the search of 20000 mass spectra nine times, in comparison to brute force search while maintaining true positive rate of \(90\%\). For ForestDSH method, the total runtime is 705, 197ms, while for brute force, the total runtime is 6,022,373ms (eight times slower than ForestDSH), and for LSH, the runtime for achieving \(90\%\) TP rate is 5,495,518ms (seven times slower than ForestDSH). The amount of memory used peaks at 220MB.

Fig. 10
figure 10

Mass spectrometry joint probability distribution in the case of a \(\log _4 Rank\), b \(\log _2 Rank\), and c no \(\log Rank\) filter. For further details on mass spectrometry joint probability distributions, see “Appendix 9”

Experiment 5

In this experiment, we considered a set of 50, 000 images (25, 000 pairs of images) from CIFAR-10 database (Krizhevsky 2009). Each pair of images consists of a grey-scale \(32\times 32\) pixels image and a noisy version of the image constructed by adding independent Gaussian noise \({\mathcal {N}}(\mu =0.5,\sigma =0.039)\) and discretizing pixels to binary. ForestDSH is able to detect true pairs with the success rate of \(99\%\) in 1970 s while brute force approach detects true pairs in 19,123 s (nine times slower than ForestDSH) and LSH detects true pairs within 15,080 s (seven times slower than ForestDSH).

Fig. 11
figure 11

In this figure, five grey-scale \(32\times 32\) pixels images along with their noisy versions are shown. Here, ForestDSH is able to detect true pairs with the success rate of \(99\%\) while being nine times faster than brute force and seven times faster than LSH

7.1 Codes

For the codes, see https://github.com/mohimanilab/ForestDSH.

8 Conclusion

ForestDSH algorithm proposed in this paper is comprehensive and efficient in the sense that for a wide range of probability distributions it enables us to capture the difference between pairs coming from a joint probability distributions and independent pairs. This algorithm is built upon a family of distribution sensitive hashes that are designed using a decision tree structure which is constructed recursively. Moreover, we prove that the decision tree introduced here has a complexity of \(O(N^{\lambda ^*})\) and there is no decision tree with lower overall complexity. We prove that this algorithm outperforms existing state of the art approaches in specific range of distributions, in theory and practice and enabled speeding up the spectral library search in mass spectrometry by a factor of nine. Finally, we should note that ForestDSH has some limitations which are discussed as follows. ForestDSH assumes that the probability distribution function is factorizable to i.i.d. components. Generalizing ForestDSH to the case of general probability distribution function is an open problem and our results here open a path towards solving the general probability mass function problem. Moreover, the special case of factorizable models are simple but crucial models that have been widely used in computational biology, e.g., see Kim and Pevzner (2014), and other areas of data sciences. In the future, we will address the more general problem of Markov chain models. Note that ForestDSH performs faster than MinHash if the data is not sparse. While for a small number of datapoints MinHash is superior to ForestDSH on sparse data, for a larger number of datapoints ForestDSH is performing the same as MinHash. This is mainly due to the fact that constant and logarithmic terms vanish in ratio as the number of datapoints grows.