1 Introduction

With the booming development of information acquisition technology, we have witnessed an explosive growth in the scale of shared data collections that are characterized by a set of relevant features and represented as high-dimensional points. Since many similar information existing in these data collections highly consumes resources of index, a fundamental task is to examine data for similar items [1] in the real-world applications such as duplicated web page removal [2, 3], advertising diversification [4], wireless sensor networks [5], graph sampling [6] and web spam [7]. When we take the distance metric between all data points as the similarity measurement, the storage and computation requirements are rigorous. Thus, the compact storage and efficient distance computation are indispensable for data representations.

Binary codes [8, 9] are attractive data representations for search and retrieval purposes due to their efficiency in computation and storage capacity. The hashing technique [1016] is a common method for assigning binary codes to data points. The binary codes serve as hash keys which are learned from the hash functions to preserve some notion of similarity in the original feature space. Suitable binary codes should maintain the general hash property of high collision probabilities for similar items. Furthermore, the storage needed to store the binary codes will be greatly decreased. The existing hashing methods can be mainly divided into two categories [13, 17, 18]: data-independent methods and data-dependent methods.

Representative data-independent methods include locality-sensitive hashing (LSH) [10, 11, 19] and its extensions [2022], and shift invariant kernel hashing (SIKH) [23]. These data-independent methods need longer codes to achieve satisfactory performance [17], which will be inefficient due to the higher storage and computational cost. In contrast, data-dependent methods learn hash functions from the training data to overcome the shortcomings of the data-independent methods. Typical methods include Semantic hashing [24], Spectral hashing (SH) [25], Binary reconstruction embedding (BRE) [21], Semi-supervise hashing (SSH) [26], Self-taught hashing [27], Composite hashing [28], Minimal loss hashing (MLH) [29] and Iterative quantization (ITQ) [17].

In recent decades, approximation distance computation [30, 31] for similarity searching is proposed by several researchers to avoid the running time bottleneck. This is due to the fact that approximate nearest neighbor is almost as good as the exact one in many cases. Locality-sensitive hashing (LSH) [22] was introduced as an approximate high-dimensional similarity search scheme with provably sub-linear time complexity dependence on the data size. The key idea is to compute randomized hash functions that guarantee a high probability of collision for similar samples. Then, one can determine near neighbors by hashing the query sample and retrieving the elements stored in the buckets containing that sample.

Minwise hashing is a standard algorithm for estimating the sets similarities. Then, several variants, i.e. b-bit minwise hashing [3237], connected bit minwise hashing [38], f-fractional bit minwise hashing [39] and one permutation hashing [40], appear from diverse perspectives to improve the original minwise hashing algorithm for better performance.

While original minwise hashing method stores each hashed value using 40 bits [2] or 64 bits [41], the b-bit minwise hashing algorithm [3336] stores the lowest b bits of each hashed value and gains substantial advantages in storage space and computational efficiency for large-scale machine learning problems. To improve the effectiveness of document similarity detection, connected bit minwise hashing algorithm [38] is proposed for computing set similarities among high-dimensional vectors. As the extension of b-bit minwise hashing, f-fractional bit minwise hashing algorithm [39] is presented for a wider range of selectivity on accuracy and storage space requirements. The feasibility of f-fractional bit minwise hashing algorithm has investigated associated with the optimal fractional bit that makes the minimum variances estimator. Instead of \(k (k\ge 100)\) permutations of existing minwise hashing algorithms, one permutation hashing algorithm merely applies one permutation, which not only preferably keeps the structure of original data set, but also avoids the disadvantages of the expensive preprocessing cost and the loss of classification accuracy.

Furthermore, experimental results in [42, 43] have shown the effectiveness of these minwise hashing algorithms on large-scale data sets. Thus, these algorithms have been studied extensively and developed rapidly for decades. Considering minwise hashing algorithm and its variants, a systematic survey is needed and beneficial to understand and utilize this style of data preprocessing techniques more easily. The purpose of this paper is to review minwise hashing algorithms in detail and provide an insightful understanding of current developments. In order to show the application prospect of the minwise hashing algorithms, we combine various algorithms with linear Support Vector Machine (SVM) [44] in classification for large-scale data set. Both theoretical analysis and experimental results demonstrate that these algorithms can achieve massive advantages in accuracy, efficiency and energy-consumption. Furthermore, their limitations, major opportunities and challenges, as well as potential important research directions have been pointed out.

Section 2 of the paper reviews Locality-sensitive Hashing (LSH) and five representative minwise hashing algorithms, i.e. minwise hashing [2, 3], b-bit minwise hashing [3336], connected bit minwise hashing [38], f-fractional bit minwise hashing [39] and one permutation hashing [40]. Section 3 describes applications of these minwise hashing algorithms for large-scale classification problems. Section 4 introduces several extensions and variants of minwise hashing algorithms. Finally, concluding remarks and future research directions are provided in Sect. 5.

2 Minwise Hashing Algorithms

2.1 Locality-Sensitive Hashing

We begin by briefly reviewing Locality-sensitive Hashing (LSH). A binary hashing function defines a mapping h(x) from \(R_d\) to the discrete set \(\{0,1\}\). In practice, each sample will be fed into a number of independent random hashing functions.

LSH is a special type of hashing algorithms with the locality-sensitive property, which basically states that if two samples are similar in the original feature space, their corresponding hash codes shall be alike. A key notion to quantify this property is collision probability, which is defined as \(Pr(h(x_1)=h(x_2))\) (the expectation of identical hash bit over all possible hashing functions). The locality-sensitive property also implies that the collision probability should be monotonically increasing with respect to the pairwise data similarity.

The basic idea behind locality-sensitive hashing (LSH) is to project the data into a low-dimensional binary space. Each data point is mapped to a l-bit binary vector termed as the hash key. With proper hash projections, the approximate nearest neighbors can be found in sublinear time with respect to the size of training samples. Define an LSH family \({\mathcal {F}}\) for a metric space \({\mathcal {M}}=(M,d)\), a threshold \(R>0\) and an approximation factor \(c>1\). \({\mathcal {F}}\) is a family of functions \(h:{\mathcal {M}}\rightarrow {\mathcal {S}}\) satisfying the following conditions for any two samples \(x_1,x_2 \in {\mathcal {M}}\) and a function h chosen uniformly at random from \({\mathcal {F}}\):

  • if \(d(x_1,x_2)\le R\), then \(h(x_1) = h(x_2)\) (i.e., p and q collide) with probability at least \(P_1\),

  • if \(d(x_1,x_2)\ge cR\), then \(h(x_1) = h(x_2)\) with probability at most \(P_2\).

A family is interesting when \(P_1>P_2\). Such a family \({\mathcal {F}}\) is called \((R,cR,P_1,P_2)\)-sensitive. For any sample, the hash key is constructed by applying l binary-valued hash functions, i.e., \(h_1,\ldots , h_l\), from the LSH family to the sample.

2.2 Minwise Hashing

Minwise hashing [2, 3] is a basic algorithmic tool for real-world problems related to set similarity and containment. Computing the size of set intersection is a crucial problem in information retrieval and machine learning. Minwise hashing applies the idea of Monte Carlo method, which transforms the problem of computing the size of set intersections into the probability of one case occurs. Due to large numbers of the permutations k, one can estimate the occurred probability of the case to achieve the document resemblance. It is worth noting that minwise hashing mainly works well with binary data represented via w-shingle, which can be viewed either as 0/1 vectors in high-dimension or as a set.

Consider two sets \({S_1}\) and \({S_2}\) where \({S_1},{S_2} \subseteq \Omega = \{ 0,1,\ldots , D- 1\}\). A generally used measurement of similarity is the resemblance

$$\begin{aligned} R=\displaystyle \frac{|S_1\cap S_2|}{|S_1\cup S_2|}=\frac{a}{f_1+f_2-a}, \end{aligned}$$
(1)

where \(f_1=|S_1|, f_2=|S_2|, a={|S_1\bigcap S_2|}\).

Applying a random permutation \({\pi }\) on two sets \(S_1\) and \(S_2\) and storing the smallest elements under \({\pi }\) as \(\min (\varvec{\pi }({S_1}))\) and \(\min (\varvec{\pi } ({S_2}))\), the collision probability is simply

$$\begin{aligned} {P_r}(\min ({\pi } ({S_1}))=\displaystyle \min ({\pi } ({S_2}))) = \frac{{| {{S_1} \cap {S_2}} |}}{{| {{S_1} \cup {S_2}} |}}. \end{aligned}$$
(2)

Then, repeat the permutation k times to estimate R without bias as the following binomial probability associated with its variance:

$$\begin{aligned}&\hat{R}_M = \frac{1}{k}\sum \limits _{j = 1}^k 1{\left\{ {\min \left( {{\pi _j}\left( {{S_1}} \right) } \right) = \min \left( {{\pi _j}\left( {{S_2}} \right) } \right) } \right\} }, \nonumber \\&Var(\hat{R}_M) = \frac{1}{k}R(1 - R). \end{aligned}$$
(3)

The common practice is to store each hashed value, e.g., \(\min (\pi ({S_1}))\) and \(\min (\pi ({S_2}))\), using 40 bits [2] or 64 bits [41]. The cost of storage and computation will be formidable in large-scale application [45]. In order to demonstrate the comprehensive promotion degree with respect to the estimations of variance, storage space and sample size, the storage-factor for minwise hashing can be constructed with 64 bits as follows:

$$\begin{aligned} M(R)= & {} 64 \times Var({{\hat{R}}_M}) \times k,\nonumber \\= & {} \displaystyle 64R(1-R). \end{aligned}$$
(4)

However, it is worth noting that the storage-factor cannot be regarded as the measurement of precision. Moreover, minwise hashing requires the storage space of 64mk bits, where m is the size of data set and 64 is the bit number for each hashed binary code.

2.3 b-Bit Minwise Hashing

For the sake of computing resemblances, it is costly in time, storage space and energy-consumption for large-scale applications. Recently, the development of b-bit minwise hashing [3336] recommend storing only the lowest b bits instead of 40 bits [2, 3] or 64 bits [41] of each hashed value. By only storing b bits, b-bit minwise hashing gains substantial advantages in terms of storage space and the speed of computation.

Given two sets \({S_1}, {S_2} \subseteq \Omega = \{ 0,1, \ldots , D-1\}\). Define the minimum hashed values of \({S_1}\) and \({S_2}\) under a random permutation \({\pi }\) to be \({{Z_1}} = \min ({\pi } ({S_1}))\) and \({Z_2} = \min ({\pi } ({S_2}))\); \({Z_1^{(b)}}\) (\({Z_2^{(b)}}\)) the lowest b bits for the hashed value \({Z_1}\) (\({Z_2}\)). Define \({e_{1,i}}\) (\({e_{2,i}}\)) the i-th lowest bit for \({Z_1^{(b)}}\) (\({Z_2^{(b)}}\)). Theorem 1 [33] provides the main probability formulations. Its proof assumes that D is large and the random permutation repeats many times, which is invariably satisfied in practice.

Theorem 1

Assume D is large.

$$\begin{aligned}&{P_b} =\Pr \left( {Z_1^{(b)}} = {Z_2^{(b)}}\right) \nonumber \\&= \Pr \left( \prod \limits _{i = 1}^b {1\{ {e_{1,i}}={e_{2,i}}\} = 1}\right) \nonumber \\&= {C_{1,b}} + (1 - {C_{2,b}})R, \end{aligned}$$
(5)
$$\begin{aligned}&{r_1} = \frac{{{f_1}}}{D}, {r_2} = \frac{{{f_2}}}{D}, {f_1} = \left| {{S_1}} \right| , {f_2} = \left| {{S_2}} \right| , \end{aligned}$$
(6)
$$\begin{aligned}&{C_{1,b}} = {A_{1,b}}\frac{{{r_2}}}{{{r_1} + {r_2}}} + {A_{2,b}}\frac{{{r_1}}}{{{r_1} + {r_2}}}, \end{aligned}$$
(7)
$$\begin{aligned}&{C_{2,b}} = {A_{1,b}}\frac{{{r_1}}}{{{r_1} + {r_2}}} + {A_{2,b}}\frac{{{r_2}}}{{{r_1} + {r_2}}}, \end{aligned}$$
(8)
$$\begin{aligned}&{A_{1,b}} = \frac{{{r_1}{{[1 - {r_1}]}^{{2^b} - 1}}}}{{1 - {{[1 - {r_1}]}^{{2^b}}}}}, \end{aligned}$$
(9)
$$\begin{aligned}&{A_{2,b}} = \frac{{{r_2}{{[1 - {r_2}]}^{{2^b} - 1}}}}{{1 - {{[1 - {r_2}]}^{{2^b}}}}}. \end{aligned}$$
(10)

For a fixed \(r_{j}\)(where \(j \in \left\{ {1,2} \right\} \) ), \(A_{j,b}\) is a monotonically decreasing function of \(b = 1,2,3 \ldots \) .

For a fixed b, \(A_{j,b}\) is a monotonically decreasing function of \({r_j} \in [0,1]\) reaching a limit: \(\mathop {\mathrm{{lim}}}\limits _{{r_j} \rightarrow 0} {{{A}}_{{{j,b}}}} = \frac{1}{2^b}\).

From Theorem 1, the desired probability equation (5) is determined by R and the ratios \({r_1} = \displaystyle \frac{{{f_1}}}{D}\) and \({r_2} = \displaystyle \frac{{{f_2}}}{D}\) for a fixed b. Meanwhile, \(A_{j,b}\) converges to zero in a rapid speed with the increasing of b. \(A_{j,b}\) is closed to zero when \(b \ge 32\). If \(R = 1\), then \({P_b} = 1\) since we have \({r_1} = {r_2}\) and \({C_{1,b}} = {C_{2,b}}\) in this case.

Theorem 1 suggests an unbiased estimator \(\hat{R}_b\) for R:

$$\begin{aligned} {\hat{R}_b} = \frac{{{{\hat{P}}_b} - {C_{1,b}}}}{{1 - {C_{2,b}}}}, {\hat{P}_b} = \frac{1}{k}\sum \limits _{j = 1}^k {\left\{ \prod \limits _{i = 1}^b {1\left\{ {e_{1,i,{\pi _j}}} = {e_{2,i,{\pi _j}}}\right\} = 1}\right\} }, \end{aligned}$$
(11)

where \({e_{1,i,{\pi _j}}}\)(\({e_{2,i,{\pi _j}}}\)) is i-th lowest bit of \({Z_1}\) (\({Z_2}\)) under the permutation \(\pi _j\). The variance is

$$\begin{aligned} Var({{\hat{R}}_b}) = \displaystyle \frac{1}{k}\frac{{\left[ {C_{1,b}} + \left( 1 - {C_{2,b}}\right) R\right] \left[ 1 - {C_{1,b}} - \left( 1 - {C_{2,b}}\right) R\right] }}{{{{\left( 1 - {C_{2,b}}\right) }^2}}} \end{aligned}$$
(12)

From (3) and (11), \(\hat{R} _b\) converges to the variance of \(\hat{R}_M\) for large b, namely \(\mathop {\mathrm{{lim}}}\limits _{b \rightarrow \infty } {{Var}}(\hat{R}_b) = {{Var}}(\hat{R}_M)\). In fact, for the purpose of practice, \(Var(\hat{R}_b)\) and \(Var(\hat{R}_M)\) are numerically indistinguishable when b is 64 bits. Intuitively, compared to (12), at the same sample size k, the estimation variance will be increasing when we use fewer bits per sample so that the accuracy is contaminated. Hence, we need increase the value of k to maintain the same accuracy. In brief, b-bit minwise hashing not only preferably improves the accuracy, but also significantly reduces the storage and computational requirements.

With the same size k, the space required for storing each sample will be smaller with the decrease of b. Unfortunately, the estimated variance (12) will increase according to the b-bit minwise hashing theory. The storage factor \(B(b;R,{r_1},{r_2})\) [3335] is proposed to accurately measure the variance-space trade-off as follows:

$$\begin{aligned} B(b;R,{r_1},{r_2})= & {} b \times Var\left( {{\hat{R}}_b}\right) \times k,\nonumber \\= & {} \displaystyle \frac{{b\left[ {C_{1,b}} + \left( 1 - {C_{2,b}}\right) R\right] \left[ 1 - {C_{1,b}} - \left( 1 - {C_{2,b}}\right) R\right] }}{{{{\left( 1 - {C_{2,b}}\right) }^2}}}. \end{aligned}$$
(13)

From that we know the lower \(B(b;R,{r_1},{r_2})\) is better. The ratio \( \displaystyle \frac{{B({b_1};R,{r_1},{r_2})}}{{B({b_2};R,{r_1},{r_2})}}\) measures the improvement of using \(b = {b_2}\) over using \(b = {b_1}\). Moreover, b-bit minwise hashing requires the storage space of bmk bits, where m is the size of data set.

2.4 Connected Bit Minwise Hashing

Based on the b-bit minwise hashing theory, connected bit minwise hashing [38] is proposed which connects bits obtained by b-bit minwise hashing algorithm. As an efficient and feasible method for similarity estimation, it can greatly reduce the number of comparisons so that the efficiency of similarity estimation is improved merely with a minor loss of accuracy. Furthermore, connected bit is convenient to be built and the performance increases with a far-reaching practical significance in large-scale data environment.

Again, consider two sets \({S_1},{S_2} \subseteq \Omega = \{ 0,1,2,\ldots ,D-1\} \) and a random permutation group \(\varvec{\pi }\), where \(\varvec{\pi }=\{\pi _1,\pi _2,\ldots , \pi _k\}\), \(\pi _j: \Omega \rightarrow \Omega \) and \(j\in \{1,2,\ldots ,k\}\). Define the minimum hashed value under \(\varvec{\pi }\) to be \(\varvec{Z_h} = \min (\varvec{\pi }({S_h}))\); \(\varvec{Z_h^{(b)}}\) the lowest b bits for each hashed value of \(\varvec{Z_h}\); \({e_{h,i,{\pi _j}}}\) the i-th lowest bit for \(\varvec{Z_h^{(b)}}\) under the permutation \(\pi _j\) and \(h\in \{1,2\}\). Experimental number k corresponds to k random independent permutations. For the connected bits b and the connected number n, define the variables:

$$\begin{aligned} {x_1}= & {} {e_{1,1,{\pi _1}}}{e_{1,2,{\pi _1}}} \ldots {e_{1,b,{\pi _1}}}{e_{1,1,{\pi _2}}}{e_{1,2,{\pi _2}}} \ldots {e_{1,b,{\pi _2}}}{e_{1,1,{\pi _3}}} \\&\cdots {e_{1,b,{\pi _3}}} \cdots {e_{1,1,{\pi _n}}} \cdots {e_{1,b,{\pi _n}}}, \\ {x_2}= & {} {e_{2,1,{\pi _1}}}{e_{2,2,{\pi _1}}} \cdots {e_{2,b,{\pi _1}}}{e_{2,1,{\pi _2}}}{e_{2,2,{\pi _2}}} \cdots {e_{2,b,{\pi _2}}}{e_{2,1,{\pi _3}}} \\&\cdots {e_{2,b,{\pi _3}}} \cdots {e_{2,1,{\pi _n}}} \cdots {e_{2,b,{\pi _n}}} \end{aligned}$$

If and only if \({e_{1,1,{\pi _j}}} \cdots {e_{1,b,{\pi _j}}} = {e_{2,1,{\pi _j}}} \cdots {e_{2,b,{\pi _j}}}\), where \(\pi _j \in \varvec{\hat{\pi }}\!=\!\{\pi _1,\pi _2,\cdots ,\pi _n\}\), we obtain \({x_1} = {x_2}\). Note that the permutation group \(\varvec{\hat{\pi }}\) consists of n permutations selected from a k-size sequential random independent permutation group \(\varvec{\pi }\) retaining its original order (\(k\gg n\)).

Compute \({G_{b,n}} = P_r ({x_1} = {x_2})= P_b^n = {[{C_{1,b}} + (1 - {C_{2,b}})R]^n}\). Then, an unbiased estimator \(\hat{R}_{b,n}\) for R from k independent permutations can be obtained as follows:

$$\begin{aligned}&{\hat{R}_{b,n}}= \frac{{{\hat{G}_{b,n}}^{\frac{1}{n}} - {C_{1,b}}}}{{1 - {C_{2,b}}}}, \end{aligned}$$
(14)
$$\begin{aligned}&\hat{G}_{b,n}=\frac{\sum \limits _{j = 1}^{\left\lfloor {\frac{k}{n}} \right\rfloor } {\left\{ {\prod \limits _{i = 1}^n {1\left\{ \begin{array}{c} {e_{1,1,{\pi _{n(j - 1) + i}}}}...{e_{1,b,{\pi _{n(j - 1) + i}}}}\\ = {e_{2,1,{\pi _{n(j - 1) + i}}}}...{e_{2,b,{\pi _{n(j - 1) + i}}}} \end{array} \right\} =1}} \right\} }}{{\left\lfloor {\frac{k}{n}} \right\rfloor }}. \end{aligned}$$
(15)

When \(n=1\), namely \({G_{b,n}} = {G_{b,1}} = {P_b}\), we obtain that the resemblance estimator \(\hat{R}_b\) of b-bit minwise hashing is the special case for \(\hat{R}_{b,n}\). Following the property of binomial distribution and the delta method [46] in statistics, the variance of \(\hat{R}_{b,n}\) is

$$\begin{aligned} Var(\hat{R}_{b,n}) = \frac{1}{k} \times \frac{{{G_{b,n}}\left( 1 - {G_{b,n}}\right) }}{{{{\left( {1 - {C_{2,b}}}\right) }^2} \times nG_{b,n}^{\frac{{2(n - 1)}}{n}}}}. \end{aligned}$$
(16)

It can be derived that the variance of connected bit minwise hashing is larger than that of b-bit minwise hashing which has some slight effects on accuracy. Fortunately, the connected bit minwise hashing can greatly reduce the number of comparisons and preferably improve the performance.

Similarly, the storage factor \(G(b,n;R,{r_1},{r_2})\) for connected bit minwise hashing [38] quantifies the variance-space trade-off as follows:

$$\begin{aligned} G(b,n;R,{r_1},{r_2})= & {} b \times n \times \left( \frac{k}{n}\right) \times Var(\hat{R}_{b,n}), \nonumber \\= & {} \displaystyle \frac{{b{G_{b,n}}\left( 1 - {G_{b,n}}\right) }}{{{{\left( 1 - {C_{2,b}}\right) }^2} \times nG_{b,n}^{\frac{{2(n - 1)}}{n}}}}. \end{aligned}$$
(17)

From that, we know the lower \(G(b,n;R,{r_1},{r_2})\) is better. Although connected bit minwise hashing has the same storage space as b-bit minwise hashing with bmk bits, where m is the size of data set, the comparisons to achieve its resemblance has substantially reduced. However, it is worth noting that the storage-factor only demonstrates the degree of comprehensive promotion of the estimation variance, storage space and sample size but cannot indicate the effect of precision.

2.5 f-Fractinal Bit Minwise Hashing

Although the higher integer bit value decreases the variance estimator and improves the accuracy, it will consume lots of time. Furthermore, the integer bit determined by b-bit minwise hashing method could not meet fine-grained requirements in terms of accuracy, computational efficiency and storage space. Thus, f-fractional bit minwise hashing algorithm [39] is proposed for a wider range of selectivity on these aspects. The key idea of this algorithm is the continuous selectivity of bit instead of the discrete integer value represented by a linear combination of bits obtained by b-bit minwise hashing method. Correspondingly, the similarity and the optimal fractional bit that makes the minimum variance estimator can be estimated. F-fractional bit minwise hashing algorithm not only enriches the theoretical system of b-bit minwise hashing algorithm, but also satisfies the various needs of accuracy and storage space in the practical system.

Consider two sets \({S_1},{S_2} \subseteq \Omega = \{ 0,1,2,\cdots ,D-1\} \) and a random permutation group \(\varvec{\pi }\), where \(\varvec{\pi }=\{\pi _1,\pi _2,\cdots ,\pi _k\}\), \(\pi _j: \Omega \rightarrow \Omega \) and \(j\in \{1,2,\cdots ,k\}\). Define the minimum hashed value under \(\varvec{\pi }\) as \(\varvec{Z_h} = \min (\varvec{\pi }({S_h}))\in R^k\); the lowest \(b_l\) bits for each dimension of \(\varvec{Z_h}\) as \(\varvec{Z_h^{(b_l)}}\in R^k\); the i-th lowest bit of the j-th dimension for \(\varvec{Z_h^{(b_l)}}\) as \(e_{h,i,\pi _j}\), where \(h \in \{1,2\}\) and \(i \in \{1,2,\cdots ,b_l\}\).

Let \(b_1,b_2\) be integer bit with \(b_1 < b_2\) and \(w_l\) be the proportion of \(b_l\) with \(w_l=\frac{k_l}{k}\), where \(l\in \{1,2\}\) and \(k_1+k_2=k\).

Define f:

$$\begin{aligned} {f=w_1b_1+w_2b_2, w_1+w_2=1,\quad b_1<b_2}. \end{aligned}$$
(18)

Define the j-th dimensional value of \(\varvec{Z_1^{(b_l)}}\) and \(\varvec{Z_2^{(b_l)}}\) as the following variables respectively: \({X_{1,j}}={e_{1,1,\pi _j}}{e_{1,2,\pi _j}} \cdots {e_{1,b_l,\pi _j}}\) and \({X_{2,j}}={e_{2,1,\pi _j}}{e_{2,2,\pi _j}} \cdots {e_{2,b_l,\pi _j}}\), where \(b_l\in \{b_1,b_2\}\) and \({\pi _j}\in \{\pi _1,\pi _2,\cdots ,\pi _k\}\).

Then

$$\begin{aligned}&P_f = \displaystyle P_r(X_{1,j}=X_{2,j}) \nonumber \\&= \displaystyle P_r(b=b_1) P_r\left( {\prod \limits _{i = 1}^{b_1} {1\left\{ {e_{1,i,\pi _j}=e_{2,i,\pi _j}} \right\} = 1} }\right) \nonumber \\&+ \displaystyle P_r(b=b_2) P_r\left( {\prod \limits _{i = 1}^{b_2} {1\left\{ {e_{1,i,\pi _j}=e_{2,i,\pi _j}} \right\} = 1} }\right) \nonumber \\&= \displaystyle w_1\left[ C_{1,b_1}+(1-C_{2,b_1})R_f\right] +w_2\left[ C_{1,b_2}+(1-C_{2,b_2})R_f\right] . \end{aligned}$$
(19)

The unbiased estimator \({\hat{R}_f}\) for \(R_f\) from k independent permutations is:

$$\begin{aligned}&{\hat{R}_f}=\frac{{\hat{P}_f}-\left( w_1C_{1,b_1}+w_2C_{1,b_2}\right) }{1-\left( w_1C_{2,b_1}+w_2C_{2,b_2}\right) }, \end{aligned}$$
(20)
$$\begin{aligned}&\hat{P_f}=\frac{1}{k}(\sum \limits _{j = 1}^{w_1k} {\left\{ {\prod \limits _{i = 1}^{b_1} {1\left\{ {e_{1,i,{\pi _j}}=e_{2,i,{\pi _j}}} \right\} = 1} } \right\} }\nonumber \\&+\sum \limits _{j = 1}^{w_2k} {\left\{ {\prod \limits _{i = 1}^{b_2} {1\left\{ {e_{1,i,{\pi _{w_1k+j}}}=e_{2,i,{\pi _{w_1k+j}}}} \right\} = 1} } \right\} }). \end{aligned}$$
(21)

Note that \(\hat{R_b}\) and \(\hat{P_b}\) are special case of \(\hat{R_f}\) and \(\hat{P_f}\) with \(w_1=0\), \(w_2=1\) and \(b_2=1\) or \(w_1=1\), \(w_2=0\) and \(b_1=1\).

Following the property of binomial distribution and the delta method [46] in statistics, the variance of \(\hat{R}_{f}\) is

$$\begin{aligned} Var(\hat{R}_{f}) = \frac{1}{k} \times \frac{{w_1}^2{P_{b_1}}(1 - {P_{b_1}})+{w_2}^2{P_{b_2}}(1 - {P_{b_2}})}{{{\left[ 1 - {(w_1C_{2,b_1}+w_2C_{2,b_2}})\right] ^2}}}, \end{aligned}$$
(22)

where \(P_{b_1}=C_{1,b_1}+(1-C_{2,b_1})R_f\) and \(P_{b_2}=C_{1,b_2}+(1-C_{2,b_2})R_f\).

The variance decreases with a larger f-bit and lies between that of its proximate integer bits. In order to satisfying various accuracy and storage space requirements, diverse combination of \(b_1\) and \(b_2\) with respective \(w_1\) and \(w_2\) can be made to constitute fractional bit f. Similarly, the storage factor \(F(w_1,w_2,b_1,b_2;R,{r_1},{r_2})\) for f-fractional bit minwise hashing [47] can be presented as:

$$\begin{aligned} F(w_1,w_2,b_1,b_2;R,{r_1},{r_2})= & {} f \times k \times Var(\hat{R}_{f})\nonumber \\= & {} \displaystyle {(w_1b_1+w_2b_2)}\nonumber \\&\times \frac{{w_1}^2{P_{b_1}}(1 - {P_{b_1}})+{w_2}^2{P_{b_2}}(1 - {P_{b_2}})}{{{\left[ 1 - {(w_1C_{2,b_1}+w_2C_{2,b_2})}\right] ^2}}}. \end{aligned}$$
(23)

According to the accuracy and storage requirements, we select an appropriate fractional bit f. It is apparent that the variance is the decreasing function of bit which simultaneously determines the size of storage space. If f satisfies \({b_1<f<b_2}\), we conclude that \({Var(b_1)>Var(f)>Var(b_2)}\) and \(\{storage(b_1)<\) \(storage(f)<storage(b_2)\}\).

In fact, there exists multifarious combinations for any fixed fractional bit f. From (18), \(w_1=\frac{b_2-f}{b_2-b_1}, w_2=\frac{f-b_1}{b_2-b_1}\). For example, when choosing \(f=2.5\), \(w_1=0.5\), \(w_2=0.5\), \(b_1=2\), \(b_2=3\) or \(w_1=0.75\), \(w_2=0.25\), \(b_1=2\), \(b_2=4\) can be selected to compose f. For any given integer bit \(b_1\) and \(b_2\), \(f=f_0\), \(1\le b_1<f_0<b_2\le 32\), \(Var(\hat{R}_f)=Var(b_1,b_2,f_0)\). Because the integer values \(b_1\) and \(b_2\) are discrete and limited, the number of the combinations for fractional bit f are finite. Thus, the variance under the numbered \(b_1\) and \(b_2\) can be calculated.

Although various combinations can be made for the fixed fractional bit f, the one which makes the minimum estimator of variance is optimal. Thus, the theoretical system of minwise hashing algorithms gets further development and ultimately attaches profound significance in large-scale data environment.

For (20), if \(r_1=\frac{f_1}{D}=r_2=\frac{f_2}{D}\), then \(C_{1,b_1}=C_{2,b_1}=C_{b_1}\) and \(C_{1,b_2}=C_{2,b_2}=C_{b_2}\). In particular, \(C_{b_1}\) and \(C_{b_2}\) can be approximated at zero if \(b_1\) and \(b_2\) are large enough. In this case, the variance of \(\hat{R}_{f_{opt}}\) can be yielded as

$$\begin{aligned} Var\left( \hat{R}_{f_{opt}}\right) =\displaystyle \frac{R_f(1-R_f)\left[ (b_2-f)^2+(f-b_1)^2\right] }{k(b_2-b_1)^2}. \end{aligned}$$
(24)

Computing the partial derivative of (24) on the components \(b_1, b_2\) and f and let them be zero, \(b_1=f\), \(b_2=f\) and \(f=\frac{b_1+b_2}{2}\). Because \(b_1\) and \(b_2\) are integer bits, f is a fractional bit and \(1\le b_1<f_0<b_2\le 32\), the optimal fractional bit is achieved with \(b_1=\lfloor f\rfloor \), \(b_2=\lceil f\rceil \), \(w_1=\frac{\lceil f\rceil -f}{\lceil f\rceil -\lfloor f \rfloor }\) and \(w_2=\frac{f-\lfloor f\rfloor }{\lceil f\rceil -\lfloor f \rfloor }\), which make the minimum estimator of variance as \(Var(\lfloor f\rfloor ,\lceil f\rceil ,f)\). Thus, the formula of optimal fractional bit is \(f_{opt}=\frac{\lceil f\rceil -f}{\lceil f\rceil -\lfloor f \rfloor }\times \lfloor f\rfloor + \frac{f-\lfloor f\rfloor }{\lceil f\rceil -\lfloor f \rfloor } \times \lceil f\rceil \) concerning the certain fractional bit f. Moreover, the storage factor of \(f_{opt}\) is

(25)

Besides, f-fractional bit minwise hashing and the optima fractional bit minwise hashing require the storage space of fmk and \(f_{opt}mk\) bits respectively, where m is the size of data set.

2.6 One Permutation Hashing

For the sake of similarity retrieval [1], the computation of similarity between sets is a core mission. Algorithms such as minwise hashing [2, 3], b-bit minwise hashing [3336], connected bit minwise hashing[38] and f-fractional bit minwise hashing [39] apply k independent random permutations on the entire data set leading to expensive cost of time, storage space and energy-consumption for similarities computation. However, one permutation hashing [40] merely utilizes one permutation without notable replacement of samples and preserves the matrix sparsity. By doing so, it not only preferably preserves the structure of original data set, but also avoids expensive preprocessing cost.

At first, the shingled binary data vector for each sample is viewed as a set consisting of the locations of the nonzero elements. Consider sets \(S_i \subseteq \Omega =\{0,1,2,\cdots ,D-1\}\), where D is the size of the space. One permutation hashing algorithm permutes each set once and convert it into a binary vector where 1 represents the new location of the nonzero element. Then, the D-dimensional binary vector is divided into r bins and the smallest nonzero element’s location of each bin is stored for each data vector.

For the theoretical analysis, define the number of “jointly empty bins” and the number of “matched bins” as respectively:

$$\begin{aligned} N_{emp}=\sum _{j=1}^k I_{emp,j}, \quad N_{mat}=\sum _{j=1}^k I_{mat,j}, \end{aligned}$$
(26)

where \(I_{emp,j}\) and \(I_{mat,j}\) are defined for the j-th bin, as

$$\begin{aligned} I_{emp,j}= & {} \left\{ \begin{array}{l} 1 \ \text {if} \ \text {both} \ \pi (S_1)\ \text {and} \ \pi (S_2)\ \text {are} \ \text {empty} \ \text {in} \ \text {the}\ \text {j}-\text {th}\ \text {bin}\\ 0 \ otherwise \end{array}\right. \end{aligned}$$
(27)
$$\begin{aligned} I_{mat,j}= & {} \left\{ \begin{array}{l} 1 \ \text {if} \ \text {both} \ \pi (S_1)\ \text {and} \ \pi (S_2)\ \text {are} \ \text {not} \ \text {empty} \ \text {and} \ \text {the} \ \text {smallest} \ \text {element} \\ \ \ \text {of} \ \pi (S_1) \ \text {matches} \ \text {the} \ \text {smallest} \ \text {element} \ \text {of} \ \pi (S_2), \ \text {in} \ \text {the}\ \text {j}-\text {th}\ \text {bin}\\ 0 \ otherwise \end{array} \right. \nonumber \\ \end{aligned}$$
(28)

Recall the notation: \(f_1=|S_1|\), \(f_2=|S_2|\), \(a=|S_1\cap S_2|\). We also use \(f=|S_1\cap S_2|=f_1+f_2-a\).

Theorem 2

\(\hat{R}_{mat}=\frac{N_{mat}}{k-N_{emp}}\),    \(E(\hat{R}_{mat})=R\),

\(Var\left( \hat{R}_{mat}\right) =R(1-R)\left( E\left( \frac{1}{k-N_{emp}}\right) \left( 1+\frac{1}{f-1}\right) -\frac{1}{f-1}\right) \),

\(E(\frac{1}{k-N_{emp}})=\sum _{j=0}^{k-1}\frac{Pr(N_{emp}=j)}{k-j}\ge \frac{1}{k-E(N_{emp})}\).

The fact that \(E(\hat{R}_{mat})=R\) may seem surprising as in general ratio estimators are not unbiased. Note that \(k-N_{emp}> 0\), because we assume the original data vectors are not completely empty (all-zero). As expected, when \(k\ll f = f1 + f2-a\), \(N_{emp}\) is essentially zero and hence \(Var(\hat{R}_{mat})\approx \frac{R(1-R)}{k}\). In fact, \(Var(\hat{R}_{mat})\) is a bit smaller than \(\frac{R(1-R)}{k}\) , especially for large k.

It is probably not surprising that our one permutation scheme (slightly) outperforms the original k-permutation scheme (at merely 1 / k of the preprocessing cost), because one permutation hashing, which is “sampling-without-replacement”, provides a better strategy for matrix sparsification.

Actually, the hashed values are aligned identically since they are generated from the same permutation. From the perspective of energy-consumption, the number of permutations reduces from k to just one which is much more computationally efficient and makes the storage realizable. Moreover, in consideration of the scheme converting many nonzero elements into zero without the destruction of the original data set, it provides a relatively satisfactory strategy for sparsity. Furthermore, two strategies are adopted for dealing with the empty bins, i.e. the Zero Coding Scheme and the m-Permutation Scheme.

3 Applications of Minwise Hashing Algorithms

In this section, we show the combinations of various minwise hashing algorithms with linear support vector machine (SVM) for large-scale document classification [48]. Both the theoretical analysis and the experimental results demonstrate that this kind of combinations can achieve massive advantages in accuracy, efficiency and energy-consumption.

Document similarity detection technology [4952] is an important topic in the information processing field, and it is a powerful tool to protect the author’s intellectual property and to improve the efficiency of information retrieval. Generally, the problem of similarity computation is transformed into finding sets with a relatively large intersection by the technique known as “shingling”. Representing documents as w-shingle sets generates lower-dimensional data sets. The value of w depends on the length of typical documents and the size of set with typical characters, where \(w\ge 5\) in prior studies [2, 41]. Since word frequency distributions with documents approximately follow a power-law [53] and most shingle terms occur rarely, it is reasonable and adequate to view w-shingle data sets as the sets of 0/1 (absence/presence) vectors in high-dimensional space. Compared with the decimal data, binary-quantized data performs well in experiments. In practice, the problem of insufficient memory capacity often appears. Thus, minwise hashing algorithms are beneficial to make the original data represented compactly.

Connected bit minwise hashing [38], f-fractional bit minwise hashing [39] and one permutation hashing [40] extend b-bit minwise hashing [3336] to improve the efficiency of similarity estimation without notable loss of accuracy. The hashed bit is convenient to be built and the performance increases with a strong practical significance in the environment of massive amounts of data.

This part compares the integrations of linear SVM with b-bit minwise hashing [35, 36], connected bit minwise hashing [43], f-fractional bit minwise hashing [42] and one permutation hashing [40] practically. In theoretical analysis, the positive definiteness of resemblance matrices generated by these minwise hashing algorithms guarantees the converting of the nonlinear SVM problem [54] into linear SVM for large-scale classification. This property makes the kernel matrices decompose into matrices inner product to be the foundation for the integration as Theorems 5, 6 and 7 shown. Through solving large-scale linear SVM with many representative software packages such as LIBLINEAR [55], SVM\(^{perf}\) [56], SGD [57, 58] and Pegasos [59], these combinations yield better performance and possess their own merits.

3.1 Integrating Linear SVM with Connected Bit Minwise Hashing Kernel

Connected bit minwise hashing algorithm can improve the efficiency of similarity estimation since the half of comparisons is greatly reduced without notable loss of accuracy. The positive definiteness of the resemblance matrices generated by connected bit minwise hashing and optimal fractional bit minwise hashing serve as the theoretical foundation for integrating linear SVM with them.

Definition 3

[36] A symmetric \(m \times m\) matrix K satisfying \(\sum \limits _{i.j} {{c_i}{c_j}} {K_{ij}} \ge 0\) for all real vectors c is called positive definite (PD).

Lemma 4

[36] Consider m sets \({S_1},\cdots , {S_m} \in \Omega = \{ 0,1,\) \(\cdots , D-1\}\). Apply one permutation \(\pi \) to each set. Define \({Z_i} = \min (\pi ({S_i}))\) and \(Z_i^{(b)}\) be the lowest b bits of \({Z_i}\). The following three matrices are PD.

  1. 1.

    The resemblance matrix \(R \in {R^{m \times m}}\), whose (i, j)-th entry is the resemblance between set \({S_i}\) and set \({S_j}\): \({R_{ij}} = \displaystyle \frac{{|{S_i} \cap {S_j}|}}{{|{S_i} \cup {S_j}|}} = \frac{{|{S_i} \cap {S_j}|}}{{|{S_i}| + |{S_j}| - |{S_i} \cap {S_j}|}}\).

  2. 2.

    The minwise hashing matrix \(M \in {R^{m \times m}}\): \({M_{ij}} = 1\{ {Z_i} = {Z_j}\}\).

  3. 3.

    The b-bit minwise hashing matrix \({M^{(b)}} \in {R^{m \times m}}\): \(M_{ij}^{(b)} = 1\{ Z_i^{(b)} = Z_j^{(b)}\} \).

Consequently, consider k independent permutations and denote \(M_{(s)}^{(b)}\) the b-bit minwise hashing matrix generated by the s-th permutation. Then, the summation \(\displaystyle \sum _{s = 1}^k {M_{(s)}^{(b)}} \) is also PD.

Theorem 5

[43] Consider m sets \({S_1},\cdots ,{S_m} \in \Omega = \{ 0,1,\cdots ,\) \( D - 1\} \). Apply one permutation group \(\varvec{\pi }\) of size n to each set \(S_i\). Define the minimum value under \(\varvec{\pi }\) as \(\varvec{Z_i} = \min (\varvec{\pi } ({S_i})) =\{\min (\pi _1(S_i)),\min (\pi _2(S_i)),\cdots ,\min (\pi _n(S_i))\}\), \(\varvec{Z_i^{(b)}}\) be the lowest b bits for each dimension of \(\varvec{Z_i}\) and \(\varvec{Z_i}, \varvec{Z_i^{(b)}} \in R^n\), \({Z_i}^{(n,b)}\) with the length of nb derived from sequentially connecting n dimensions of \(\varvec{Z_i^{(b)}}\), \({Z_i}^{(n,b)}\in R^1\) and n the number of connected bits. The matrix generated by connected bit minwise hashing is PD.

Instead of just one permutation group \(\varvec{\pi }\) consist of n permutations, we use k (\(k \gg n\)) permutations and thus there will be \(\displaystyle \left\lfloor {k/n} \right\rfloor \) connected bit minwise hashing matrices. Define \({M_{(s)}^{(n,b)}}\) as a connected bit minwise hashing matrix generated by the s-th permutation group with the size of n, where \(s = 1,2,\cdots ,\displaystyle \left\lfloor {k/n} \right\rfloor \). Note that the summation \(\displaystyle \sum \limits _{s = 1}^{\left\lfloor {k/n} \right\rfloor } {{M_{(s)}^{(n,b)}}} \) is still PD, since \({c^{\top }}\Big [\sum \limits _{s = 1}^{\left\lfloor {k/n} \right\rfloor } {{M_{(s)}^{(n,b)}}\Big ]c = \sum \limits _{s = 1}^{\left\lfloor {k/n} \right\rfloor } {{c^{\top }}{M_{(s)}^{(n,b)}}} } c \ge 0\) for arbitrary vector c by the reason that \({M_{(s)}^{(n,b)}}\) is PD.

Seemingly, the positive definiteness of \({M_{(s)}^{(n,b)}}\) is not beneficial enough for efficient linear SVM training since it is a nonlinear operation. A concise strategy can be provided to construct a matrix \(B_s\) to decompose the resemblance matrix as an inner product satisfying \({M_{(s)}^{(n,b)}} = {B_s^{\top }}B_s\), where \(B_s\) has dimensions \({2^{nb}} \times {2^{nb}}\). It is high-effective to connect the value of b-bit minwise hashing. Meanwhile, the decomposition of resemblance matrix can be taken as the transformation of kernel matrix for SVM model.

Consider a data set \(\left\{ {({x_i},{y_i})} \right\} _{i = 1}^m\), where a binary data vector \({x_i} \in {R^D}\) and \(y_i \in \{-1,1\}\). Apply k random permutations on each feature vector \(x_i\) and store the lowest b bits of each hashed value. Then, connect every successive n b-bit and obtain a new data set using mbk bits in total. Later, expand each new data point into a \(\left\lfloor {k/n} \right\rfloor \)-dimensional vector and convert the above vector into a \({2^{nb}} \times \left\lfloor {k/n} \right\rfloor \)-length vector with exactly \(\left\lfloor {k/n} \right\rfloor \) 1’s at run-time.

For example, suppose \(k=6\) and the original hashed values are \(\left\{ {4,3,3,3,1,1} \right\} \) whose binary digits are \(\left\{ {100,011,011,011,001,001} \right\} \). Consider \(b=1\). The binary digits are stored as \(\left\{ {0,1,1,1,1,1} \right\} \). Then, set connected bit number \(n=2\) and connect consecutive 2 bits of the stored digits as \(\left\{ {01,11,11} \right\} \) corresponding to \(\left\{ {1,3,3} \right\} \) in decimals. At run-time, expand it into a 12-length (\({2^{nb}} \times \left\lfloor {k/n} \right\rfloor = {2^{2 \times 1}} \times \left\lfloor {6/2} \right\rfloor = 12\)) vector, to be \(\{0,0,1,0,1,0,0,0,1,0,0,0\}\) and it is a new feature vector applicable to the “LIBLINEAR” solver. The expansion is directly and reasonably based on Theorem 5.

3.2 Integrating Linear SVM with F-Fractional Bit Minwise Hashing Kernel

The positive definiteness of the resemblance matrices generated by f-fractional bit minwise hashing and optimal fractional bit minwise hashing has been proved as the theoretical foundation for integrating linear SVM with them.

Theorem 6

[42] Consider m sets \({S_1},\cdots ,{S_m} \in \Omega = \{ 0,1,\cdots ,\) \( D - 1\} \) and a permutation group \(\varvec{\pi }\) \((\varvec{\pi }=\{\pi _1,\pi _2, \cdots , \pi _k\})\). Apply the permutation group \(\varvec{\pi }\) to each set. Define the minimum value under the permutation group \(\varvec{\pi }\) as \(\varvec{Z_h} = \min (\varvec{\pi }({S_h}))\); the lowest \(b_l\) bits for each dimension of \(\varvec{Z_h}\) as \(\varvec{Z_h^{(b_l)}}\) and \(\varvec{Z_h}\), \(\varvec{Z_h^{(b_l)}}\in R^k\), where \(h \in \{1,2\}\) and \(b_l \in \{b_1,b_2\}\) Define \(f=w_1b_1+w_2b_2\), \(w_1+w_2=1, b_1, b_2\) be integer bit with \(b_1 < b_2\) and \(w_l\) be the proportion of \(b_l\) with \(w_l=\frac{k_l}{k}\), where \(l\in \{1,2\}\). Retaining the order of original permutation group, \(k_1\) is corresponding to the former \(w_1\) percent permutations and \(k_2\) is corresponding to the latter \(w_2\) percent permutations. Thus, define \(\varvec{Z_{(h;1,w_1k)}^{(b_l)}}\) be comprised of the former \(w_1k\) dimension of the k-dimensional vector \(\varvec{Z_h^{(b_l)}}\) and \(\varvec{Z_{(h;1,w_1k)}^{(b_l)}} \in R^{w_1k}\); \(\varvec{Z_{(h;w_1k+1,k)}^{(b_l)}}\) be comprised of the latter \(w_2k\) dimension of the k-dimensional vector \(\varvec{Z_h^{(b_l)}}\) and \(\varvec{Z_{(h;w_1k+1,k)}^{(b_l)}} \in R^{w_2k}\). Let \(b_1=\lfloor f\rfloor \), \(b_2=\lceil f\rceil \), \(w_1=\frac{\lceil f\rceil -f}{\lceil f\rceil -\lfloor f \rfloor }\) and \(w_2=\frac{f-\lfloor f\rfloor }{\lceil f\rceil -\lfloor f \rfloor }\) and thus the formula of optimal fractional bit combination is \(f_{opt}=\frac{\lceil f\rceil -f}{\lceil f\rceil -\lfloor f \rfloor }\times \lfloor f\rfloor + \frac{f-\lfloor f\rfloor }{\lceil f\rceil -\lfloor f \rfloor } \times \lceil f\rceil \). The following two matrices are PD:

  1. 1.

    The f-fractional minwise hashing matrix \(M^{(f)}\in R^{m\times m}\), where (ij)-th entry is the resemblance between the sets \(S_i\) and \(S_j\): \( M_{ij}^{(f)}=1\{ \varvec{Z_{(i;1,w_1k)}^{(b_1)}} =\varvec{Z_{(j;1,w_1k)}^{(b_1)}}\}\times 1\{ \varvec{Z_{(i;w_1k+1,k)}^{(b_2)}} =\varvec{Z_{(j;w_1k+1,k)}^{(b_2)}}\}.\)

  2. 2.

    The optimal fractional minwise hashing matrix \(M^{(f_{opt})}\in R^{m\times m}\), where (ij)-th entry is the resemblance between the sets \(S_i\) and \(S_j\): \( M_{ij}^{(f_{opt})}=1\{ \varvec{Z_{(i;1,w_1k)}^{(\lfloor f \rfloor )}} =\varvec{Z_{(j;1,w_1k)}^{(\lfloor f \rfloor )}}\}\times 1\{ \varvec{Z_{(i;w_1k+1,k)}^{(\lceil f\rceil )}} =\varvec{Z_{(j;w_1k+1,k)}^{(\lceil f\rceil )}}\}\), where \(w_1=\frac{\lceil f\rceil -f}{\lceil f\rceil -\lfloor f \rfloor }\).

However, it seems that the PD of \(M^{(f)}\) and \(M^{(f_{opt})}\) are not beneficial enough for efficient linear SVM training since they are nonlinear operations. In spite of this, a concise strategy can be obtained to construct matrices B and C satisfying \(M^{(f)}=B^{\top } B\) and \(M^{(f_{opt})}= B_{opt}^{\top } B_{opt}\).

Consider a data set \(\left\{ {({x_i},{y_i})} \right\} _{i = 1}^m\), where \({x_i} \in {R^D}\) is a D-dimension binary data vector and \(y_i \in \{-1,1\}\). For a certain fractional bit f, choose k random permutations acting on each feature vector \(x_i\) and then store the lowest \(b_1\) bits as \(\varvec{Z^{(b_1)}}\) and \(b_2\) bits as \(\varvec{Z^{(b_2)}}\) of each binary hashed value. Then, select the former \(w_1k\) dimensions of \(\varvec{Z^{(b_1)}}\) and the latter \(w_2k\) dimensions of \(\varvec{Z^{(b_2)}}\) and combine them as a new k-dimensional vector retaining their original order. Thus, a new binary data set is achieved with fkm bits, where \(f=w_1b_1+w_2b_2\). Later, convert each new binary data point into a k-dimensional vector in decimals and expand the above vector into a \({(2^{b_1}w_1k+2^{b_2}w_2k)}\)-length vector with exactly k 1’s at run-time.

For example, suppose \(k=6\) and the original hashed values are \(\left\{ {4,3,3,3,1,1} \right\} \) whose binary digits are \(\{100,011,011,011,\) \(001,001\}\). Considering \(f=2.5\) with its optimal combination as \(f_{opt}=\frac{\lceil f\rceil -f}{\lceil f\rceil -\lfloor f \rfloor }\times \lfloor f\rfloor + \frac{f-\lfloor f\rfloor }{\lceil f\rceil -\lfloor f \rfloor } \times \lceil f\rceil =\frac{1}{2}\times 2+\frac{1}{2}\times 3\), thus, the binary digits are stored as \(\left\{ {00,11,11,011,001,001} \right\} \) corresponding to \(\{0,3,3,3,1,1\}\) in decimals. At run-time, expand it into a 36-length \((2^{b_1}w_1k+2^{b_2}w_2k={2^{2}\times \frac{1}{2}\times 6+2^{3}\times \frac{1}{2}\times 6}=36)\) vectors, to be \(\{0,0,0,1, 1,0,0,0, 1,0,0,0, 0,0,0,0,1,0,0,0, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1\}\) and it is a new feature vector applicable to the “LIBLINEAR” solver. The expansion is directly and reasonably based on Theorem 6.

3.3 Integrating Linear SVM with One Permutation Hashing Kernel

Based on the assumption that the shingled data vectors are binary, relatively sparse and high-dimensional, the resemblance matrix generated by one permutation hashing can be regarded as the kernel matrix of SVM for large-scale text classification. The positive definiteness of the resemblance matrix provides a solid theoretical foundation for the integration. According to the superior decomposability of the positive definite resemblance matrix, the desired performance can be achieved.

Theorem 7

Consider m sets \({S_1},\cdots ,{S_m} \in \Omega = \{ 0,1,\cdots ,\) \( D - 1\} \). Apply one permutation hashing scheme and suppose the space \(\Omega \) is divided evenly into r bins. Assume the number of zero elements is small compared to r and “*” represents empty bin coped with “zero coding” strategy. The one permutation hashing matrix \(M^{(o)}\in R^{m\times m}\) is PD, where (xy)-th entry is the resemblance between the sets \(S_x\) and \(S_y\).

Although \(M^{(o)}\) is a nonlinear operation corresponding to the kernel matrix, the positive definiteness of it provides a concise strategy to construct a matrix B satisfying \(M^{(o)}=B^{\top } B\) and ensures the rationality of the transformation from a non-linear operation into a linear operation. Thus, similar model of standard linear SVM can be yielded and solved by “LIBLINEAR”.

Consider a data set \(\left\{ {({x_i},{y_i})} \right\} _{i = 1}^m\), where the data vector \({x_i} \in {R^D}\) and \(y_i \in \{-1,1\}\). Choose a random permutation acting on each feature vector \(x_i\) and view it as a binary vector, where “1” represents the new location of the existing elements. Divide the D-dimensional space evenly into r bins. In each bin for one data vector, select the location of the smallest nonzero element and represent it as a binary data. Then, store the lowest b bits of each binary value and obtain a new data set using rbm bits. Later, convert each new binary data into a r-dimensional vector in decimals and expand it into a \({2^{b}r}\)-length vector with exactly r 1’s at most.“*” denotes the empty bin processed by “zero coding” strategy.

For example, suppose \(D=25\) and \(r=5\). The original hashed values under the permutation \(\pi \) are \(\{{2,4,7,10,12,17,19}\}\). Take the smallest element for each bin and store them as \(\{ {2,7,10,17,*}\}\), whose binary digits are \(\{10,111,1010,10001,*\}\). Select \(b=2\) and thus the binary digits are stored as \(\{ 10,11,10,01,*\}\) corresponding to \(\{2,3,2,1,*\}\) in decimals. At run-time, expand them into a 20-length \(({2^{b}}r=2^{2}\times 5=20)\) vector with “zero coding” strategy to be \(\frac{1}{\sqrt{5-1}}\{0,1,0,0, 1,0,0,0,0,1,0,0, 0,0,1,0, 0,0,0,0\}\) and regard it as a new feature vector applied to linear SVM solver. The expansion is reasonably based on the positive definiteness of one permutation hashing matrix in Theorem 7. Moreover, binary data performs well compared with the original decimal data in the experiments. Therefore, the goal of compressing the original data set is realized for linear SVM training.

3.4 Experimental Comparisons of the Integrations with Minwise Hashing Algorithms

In this subsection, we compare the performance of the integrations of linear SVM with b-bit minwise hashing, connected bit minwise hashing, f-fractional bit minwise hashing and one permutation hashing experimentally on large-scale data set webspam with 350,000 samples, each of which has 16,609,143 features. We split the set to 80 % for training and the remaining 20 % for testing and repeat every experiment 50 times. LIBLINEAR is chosen as the workhorse and all the experiments are conducted on the workstation with Intel(R) Xeon(R) CPU (E5-2630@2.30GHz) and 32GB RAM, under Centos 7 System. A series of penalty parameter C are set ranging from 0.01 to 10 and each sample is normalized to an unite vector. Besides, b-bit minwise hashing SVM (BBMH SVM for short) sets \(k\in \{100,300,600,900\}\) and \(b\in \{2,4,6,8\}\); connected bit minwise hashing SVM (CBMH SVM for short) sets \(k\in \{100,300,600,900\}\), \(b\in \{2,4,6,8\}\) and \(n\in \{1,2\}\); f-fractional bit minwise hashing SVM (FBMH SVM for short) sets \(k\in \{100,200,300,400,500\}\) and f varying from 1 to 8 with the interval of 0.25 and one permutation hashing SVM (One-perm SVM for short) sets \(r\in \{128,256,512,1024\}\) and \(b\in \{2,4,6,8\}\) (r is the bins’ number and b is the bits’ number). The best experimental results from the perspective of accuracy (%) and time (second) on current experimental setup are highlighted in bold letters.

Table 1 Experimental results accuracy (time) of the BBMH SVM and CBMH SVM algorithms on webspam
Table 2 Experimental results accuracy (time) of the FBMH SVM algorithm on webspam
Table 3 Experimental results accuracy (time) of the One-perm SVM algorithm on webspam

The experimental results in Table. 1 illustrate that the CBMH SVM algorithm reduces the processing time with a minor loss of accuracy compared with the BBMH SVM algorithm. It means that both the CBMH SVM and BBMH SVM algorithms have their own advantages. The experimental results in Table. 2 show that the FBMH SVM algorithm with the optimal combination of fractional bit yields better performance with less time-consumption. From the results in Table 3, the data compression by one permutation hashing scheme results in the significant reduction of the storage space and the data loading time in large-scale classification. Due to the better preservation of the original data structure, the One-perm SVM algorithm can be applied to the machines with low configuration keeping the similar experimental accuracy to that of the original data set.

In summary, one can select appropriate processing algorithms on account of the characteristics of the data set and the system requirements.

4 Extensions and Variants of Minwise Hashing Algorithms

For the sake of similarity retrieval [1], the computation of similarity between sets is a core mission. Minwise hashing algorithm [2, 3, 60] stores the data set compactly and computes the distance to characterize similarity between data representations efficiently. B-bit minwise hashing algorithm [3336, 61] reduces bits of traditional minwise hashing from 64-bit to b-bit and saves both storage space and computing time. Connected bit minwise hashing[38] and f-fractional bit minwise hashing[39] improve the previous b-bit minwise hashing on time-consumption and accuracy. However, these algorithms require applying k independent random permutations on the entire data set leading to expensive cost of time, storage space and energy-consumption for the computation of similarities. However, one permutation hashing [40] appears without notable replacement of samples and preserves the matrix sparsity. Recently, many extensions and variants of minwise hashing algorithms have been proposed [62].

Li [63] developed 0-bit consistent weighted sampling (CWS) for efficiently estimating min-max kernel, which is a generalization of the resemblance kernel originally designed for binary data. Because the estimator of 0-bit CWS constitutes a positive definite kernel, this method can be naturally applied to large-scale data mining problems. By feeding the sampled data from 0-bit CWS to a highly efficient linear classifier, a nonlinear classifier can be trained effectively and approximately based on the min-max kernel. The min-max kernel often provides an effective measure of similarity for nonnegative data through an extensive classification study using kernel machines. Although the min-max kernel is nonlinear and might be difficult to be used for large-scale industrial applications, 0-bit CWS is a simplification of the original CWS method to build linear classifiers to approximate min-max kernel classifiers.

[64] focused on a simple 2-bit coding scheme and develop accurate nonlinear estimators of data similarity based on the 2-bit strategy. In the task of near neighbor search, a crucial step is to compute or estimate data similarities once a set of candidate data points have been identified by hash table techniques. The 2-bit coding scheme appears to be overall a good choice for building hash tables in near neighbor search and developing accurate nonlinear estimators.

Shrivastava and Li [6567] proposed asymmetric minwise hashing (MH-ALSH) to provide a solution for estimating set resemblance. The new scheme utilizes asymmetric transformations to cancel the bias of traditional minwise hashing towards smaller sets, making the final “collision probability” monotonic in the inner product. The theoretical comparisons show that MH-ALSH is provably better than traditional minwise hashing and other recently proposed hashing algorithms for the task of retrieving with binary inner products.

Weighted minwise hashing (WMH) [68] is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large-scale search and learning. The resource bottleneck of the algorithms is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. Exact weighted minwise hashing broke the expensive barrier and showed an expected constant amortized time algorithm with only a few bits of storage per hash value.

The query complexity of locality sensitive hashing (LSH) is dominated by the number of hash evaluations, and this number grows with the data size. [69] presented a hashing technique to generate all the necessary hash evaluations for similarity search using one single permutation. The key of the proposed hash function is a “rotation” scheme which is the sparse sketches of one permutation hashing in an unbiased fashion thereby maintaining the LSH property.

[70] studied large-scale regression analysis where both the number of variables and observations may be large and in the order of millions or more. In order to make progress, one must seek a compromise between statistical and computational efficiency. For dealing with this large-scale problem, the proposed min-wise hash random-sign mapping (MRS mapping) is a dimensionality reduction technique for a sparse binary design matrix. It allows for the construction of variable importance measures, and is more amenable to statistical analysis. For both linear and logistic models, finite-sample bounds were given on the prediction error of procedures which perform regression in the new lower-dimensional space after applying MRS mapping.

In [71], Li et al. developed a parallelization scheme using GPUs, which reduced the processing time. Reducing the preprocessing time is highly beneficial in practice, for example, for duplicate web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers (when the test data are not preprocessed). The identification of compound-protein interactions plays key roles in the drug development toward discovery of new drug leads and new therapeutic protein targets. The paper [72] developed a novel chemogenomic method to make a scalable prediction of compound-protein interactions from heterogeneous biological data using minwise hashing. In [73], the twisted tabulation was invented to yield Chernoff-style concentration bounds and high probability amortized performance bounds for linear probing when using minwise for similarity estimation to reduce variance.

5 Remarks and Future Directions

Minwise hashing schemes can improve the computation efficiency and save the storage space without notable loss of accuracy. This paper has offered a systematic review of minwise hashing algorithms and the variants mainly including five basic algorithms: minwise hashing, b-bit minwise hashing, connected bit minwise hashing, f-fractional bit minwise hashing and one permutation hashing. Based on the five algorithms, the extensions and variants of minwise hashing algorithms are presented. Some of these algorithms have already been used in the real-life applications, such as large-scale regression and classification and scalable prediction of compound-protein interactions etc. Researchers in data mining, especially in SVMs can benefit from this survey in better understanding the essence of the minwise hashing algorithms. Furthermore, their limitations, major opportunities and challenges, as well as potential important research directions have been pointed out.

As we can see, the extensions or variants of minwise hashing algorithms were mostly based on the data-independent hashing functions, so there is a great space for developing methods based on the data-dependent hashing functions. These functions should also have the same desirable properties such as good generalization, scalability, simple and easy implementation of algorithm robustness, as well as local sensitivity. From this respect, we believe that minwise hashing algorithms can produce better results and worth to be considered.