Keywords

12.1 Introduction

Many machine learning applications are faced with very large and high-dimensional datasets, resulting in challenges in scaling up training algorithms and storing the data [1]. Hashing algorithms such as minwise hashing [2, 3] and random projections [4, 5] reduce storage requirements and improve computational efficiency, without compromising on estimation accuracy. b-bit minwise hashing (bBMWH) [6, 7] is a recent progress for efficiently (in both time and space) computing resemblances among extremely high-dimensional binary vectors. bBMWH can be seamlessly integrated [1] with linear support vector machine and logistic regression solvers.

In traditional statistical theory, no other estimation is uniformly better than the observed average when applied to observations. The paradoxical element in James–Stein estimation is that it contradicts traditional statistical theory elemental law if there are three or more sets of data, even when the three sets are completely unrelated [8, 9]. For example, the unrelated datasets of the estimates of the average price of HDB flats in Singapore, the chance of rain in London, and the average height of Americans can be combined to obtain an estimate better than computing the estimates individually in terms of mean squared error. When first proposed, the James–Stein estimator seemed counterintuitive and illogical. However, it has been proven to have lower mean squared error than the traditional maximum likelihood estimator, when there are at least three parameters of interest [8, 9].

12.2 Hypothesis

In this study, we hypothesized that adding James–Stein estimation to bBMWH improves the precision, recall, and F1-score and decreases the mean square error of the estimate from the hashing algorithm.

12.3 Materials and Methods

We briefly review the following: James–Stein estimation [9], minwise hashing [2, 3], and b-bit minwise hashing [7].

12.3.1 James–Stein Estimation

Given a random vector \({{\varvec{z}}} \tilde \ {\mathcal{N}}_N \left( {{{\varvec{\mu}}}, I} \right) \), the James–Stein estimator is defined to be

$$ {\hat{\user2{\mu }}}^{\left( {{{\varvec{JS}}}} \right)} = \left( {1 - \frac{N - 2}{S}} \right) {{\varvec{z}}} $$
(12.1)

where \(S = \vert \varvec{z} \vert ^{2}\).

\(N\) is the number of true means we want to estimate across datasets.

For \(N = 3\), \({{\varvec{\mu}}}\) could be a vector containing the true average price of HDB flats in Singapore, true chance of rain in London, and the true average height of Americans. Given some observations, we want to estimate \({{\varvec{\mu}}}\) with \({\hat{\user2{\mu }}}\).

The maximum likelihood estimator (MLE) for \({{\varvec{\mu}}},{\varvec{ \hat{\mu }}}^{\left( {{{\varvec{MLE}}}} \right)}\) maximizes a likelihood function under an assumed statistical model, so that the observed data is most probable. The likelihood of a N-variate normal distribution has a closed form and thus can be maximized by using numerical methods such as the Newton–Raphson method to obtain the roots of its derivative.

The following theorem is taken from [9] and restated here:

Theorem 1

For \(N \ge 3\) , the James–Stein estimator dominates the MLE \({{\varvec{\mu}}}\) in terms of expected total squared error that is

$$ {{\varvec{E}}}_{{\varvec{\mu}}} \left\{ {{\hat{\user2{\mu }}}^{\left( {{{\varvec{JS}}}} \right)} - {{\varvec{\mu}}}^2 } \right\} < {{\varvec{E}}}_{{\varvec{\mu}}} \left\{ {{\hat{\user2{\mu }}}^{\left( {{{\varvec{MLE}}}} \right)} - {{\varvec{\mu}}}^2 } \right\} $$
(12.2)

for every choice of \({{\varvec{\mu}}}\).

12.3.2 Minwise Hashing

Computing the size of set intersections is a fundamental problem in information retrieval, databases, and machine learning. Given two sets, \(S_1\) and \(S_2\), where

$$ S_1 , S_2 \subseteq {\Omega } = \left\{ {0, 1, 2, \ldots , D - 1} \right\}, $$

a basic task is to compute the joint size \(a = \left| {S_1 \cap S_2 } \right|\), which measures the (un-normalized) similarity between \(S_1\) and \(S_2\).

The Jaccard similarity or resemblance, denoted by R, provides a normalized similarity measure:

$$ R = \frac{{\left| {S_1 \cap S_2 } \right|}}{{\left| {S_1 \cup S_2 } \right|}} = \frac{a}{f_1 + f_2 - a }\quad {\text{where}}\,f_1 = \left| {S_1 } \right|, f_2 = \left| {S_2 } \right| $$

Computation of all pairwise resemblances takes \({\mathcal{O}}\left( {N^2 D} \right)\) time, as one would need to iterate over all \(\left( {\begin{array}{*{20}c} N \\ 2 \\ \end{array} } \right)\) pairs of vectors and for each pair of vectors, over all \(D\) elements in the set.

In most cases, \(D\) is sufficiently big to make direct computation infeasible.

The original minwise hashing method [2, 3] has become a standard technique for estimating set similarity (e.g., resemblance). We briefly restate the algorithm here as follows:

Suppose a random permutation \(\pi\) is performed on \({\Omega }\), i.e.,

$$ \pi {:}\, {\Omega } \to {\Omega },\quad {\text{where}}\,{ } {\Omega } = \left\{ {0, 1, \ldots , D - 1} \right\} $$

A simple probability argument shows that

$$ {\bf{Pr}}\left( {\min \left( {\pi \left( {S_1 } \right)} \right) = \min \left( {\pi \left( {S_2 } \right)} \right)} \right) = \frac{{\left| {S_1 \cap S_2 } \right|}}{{\left| {S_1 \cup S_2 } \right|}} = R $$
(12.3)

After \( k\) minwise independent permutations, \(\pi_1 ,\pi_2 , \ldots , \pi_{k } ,\) one can estimate \(R\) without bias, as a binomial probability,

$$ \hat{R}_M = \frac{1}{k}\mathop \sum \limits_{j = 1}^k 1\left\{ {\min \left( {\pi_j \left( {S_1 } \right)} \right) = \min \left( {\pi_j \left( {S_2 } \right)} \right)} \right\} $$
(12.4)
$$ {\text{Var}}\left( {\hat{R}_M } \right) = \frac{1}{k}R\left( {1 - R} \right). $$
(12.5)

This reduces the time complexity to \({\mathcal{O}}\left( {N^2 k} \right)\) where k is the number of permutations, thus reducing the time taken while sacrificing some accuracy.

12.3.3 b-Bit Minwise Hashing

By only storing the lowest b bits of each (minwise) hashed value (e.g., b = 1 or 2), b-bit minwise hashing can gain substantial advantages in terms of computational efficiency and storage space [7].

The following theorem is taken from the paper on b-bit minwise hashing by Li and Konig [7], which we restate here as follows:

Theorem 2

Define the minimum values under \(\pi\) to be \(z_i\) and \(z_2\):

$$ z_1 = {\text{min}}\left( {\pi \left( {S_1 } \right)} \right), z_2 = {\text{min}}\left( {\pi (S_2 } \right)). $$

Define \(e_{1,i} \) and \(e_{2,i}\) to be the \( i\) th lowest bit of \(z_1 \) and \(z_2\), respectively.

$$ E_b = {{\varvec{Pr}}}\left( {\mathop \prod \limits_{i = 1}^b 1{\text{ \{ }}e_{1,i} = { }e_{2,i} \} = 1} \right). $$
(12.6)

Assuming D is large,

$$ {{\varvec{Pr}}} \left( {\mathop \prod \limits_{i = 1}^b 1{\text{ \{ }}e_{1,i} = { }e_{2,i} \} = 1} \right) = C_{1,b} + \left( {1 - C_{2,b} } \right)R $$
(12.7)

where

$$ r_1 = \frac{f_1 }{D} , r_2 = \frac{f_2 }{D} , $$
(12.8)
$$ C_{1,b} = A_{1,b} \frac{r_2 }{{r_1 + r_2 }} + A_{2,b} \frac{r_1 }{{r_1 + r_2 }}, $$
(12.9)
$$ C_{2,b} = A_{1,b} \frac{r_1 }{{r_1 + r_2 }} + A_{2,b} \frac{r_2 }{{r_1 + r_2 }} $$
(12.10)
$$ A_{1,b} = \frac{{r_1 \left[ {1 - r_1 } \right]^{2^b - 1} }}{{1 - \left[ {1 - r_1 } \right]^{2^b } }}, A_{2,b} = \frac{{r_2 \left[ {1 - r_2 } \right]^{2^b - 1} }}{{1 - \left[ {1 - r_2 } \right]^{2^b } }} $$
(12.11)

\(\hat{R}_b\) is an unbiased estimator of \(R\):

$$ \hat{R}_b = \frac{{\hat{E}_b - C_{1,b} }}{{1 - C_{2,b} }}, $$
(12.12)
$$ \hat{E}_b = \frac{1}{k}\mathop \sum \limits_{j = 1}^k \left\{ {\mathop \prod \limits_{i = 1}^b 1 {\text{\{ }}e_{1,i} = e_{2,i} \} = 1} \right\}, $$
(12.13)

where \(e_{1,i, \pi_j }\) and \(e_{2,i, \pi_j } \) denote the \(i\) th lowest bit of \(z_1, z_2\) under the permutation \(\pi_j\), respectively.

Following property of binomial distribution, we obtain

$$ \begin{aligned} {\text{Var}} \left( {\hat{R}_b } \right) & = \frac{{{\text{Var}}\left( {\hat{E}_b } \right)}}{{\left[ {1 - C_{2,b} } \right]^2 }} = \frac{1}{k} \frac{{ E_b \left( {1 - E_b } \right)}}{{\left[ {1 - C_{2,b} } \right]^2 }} \\ & = \frac{1}{k}\frac{{[C_{1,b} + \left( {1 - C_{2,b} } \right)R\left] \right[1 - C_{1,b } - \left( {1 - C_{2,b} } \right)R]}}{{\left[ {1 - C_{2,b} } \right]^2 }} \\ \end{aligned} $$
(12.14)

12.3.4 Experiment

We used Python 3.7.10 with vectorization to implement bBMWH and James–Stein estimation. We also used it to plot our graphs of the results. We computed the precision, recall, F1-score, and MSE at various R0 values, using bBMWH with b = 1,2,3,4 bits with and without James–Stein estimation and also the original minwise hashing. We aimed to determine the smallest bit possible to save storage space and improve computational efficiency, while maintaining good levels of precision and recall.

Our experiment adopted a similar methodology as the Experiment 3 in the landmark b-bit minwise hashing paper by Li and Konig [7].

The word dataset used is a collection of the first 1000 documents (499,500 pairs) from the Bag of Words Dataset (KOS) in the UCI Machine Learning Repository [10].

We represented the i-th document as a binary vector Xi of size w, the total number of distinct words in the dataset. For this dataset, w = 6906. The j-th element of Xi will be 1 if the word occurs in the document and 0 otherwise.

We then computed the true pairwise resemblances for all documents in the dataset using the binary vectors and counted the number of pairs with R ≥ R0 (Table 12.1).

Table 12.1 Document pairs in the dataset with R ≥ R0

We conducted our experiment for R0 ∈ {0.3, 0.4, 0.5, 0.6} to represent the range covered in the abovementioned experiment. We ran bBMWH to compute the estimate of pairwise resemblances between vectors in X, represented as a square matrix. We then took the upper triangular portion of this matrix and flattened it to get a vector of \(\left( \frac{N}{2} \right)\) elements res, representing the list of pairwise resemblances. We then ran James–Stein estimation on this vector which shrank the results toward 0, to obtain another vector jsres.

We compared these estimates to the true resemblances by selecting all elements with R ≥ R0 and computing the precision and recall of these estimates in identifying pairs with R ≥ R0.

Using the precision and recall, we calculated the F1-score. We also calculated the mean squared error (MSE) of these estimates with the true resemblance. This was done for k = 500 permutations and averaged over 100 iterations.

12.4 Results and Discussion

We present some findings for R0 = 0.30 which are more significant here.

All experiments were done for k = 500 permutations and averaged over 100 iterations.

We plotted all graphs with error bars to represent the variance in our obtained values instead of relying on point estimates.

12.4.1 Precision

For b ≤ 2, the precision for both bBMWH and bBMWH combined with James–Stein estimation was low at less than 0.2. This agrees with previous research [7] where using b = 1 bit per hashed value yields a sufficiently high precision only when R0 ≥ 0.5.

At b = 3, the precision for bBMWH increased to 0.8 even for a small k = 100. Adding James–Stein estimation to bBMWH further increased the precision.

At b = 4, the precision for bBMWH increased to 0.9 for k < 100. Adding James–Stein estimation to bBMWH further increased the precision to near 1.0 (Fig. 12.1).

Fig. 12.1
figure 1

Precision of bBMWH with and without James–Stein estimation with R0 = 0.30 for b = 1, 2, 3, 4 and original minwise hashing

12.4.2 Recall

A high recall of more than 0.9 was obtained for bBMWH alone and bBMWH with James–Stein estimation at the lowest bit of b = 1. This result is in accordance with previous research [7] where recall values are all very high even when very few bits are used and R0 is low. The recall values are all very high and do not well differentiate between bBMWH with or without James–Stein estimation (except for low values of k).

However, for b = 4, the recall value decreased to 0.8 when James–Stein estimation was added. An increase in precision with James–Stein estimation results in a decrease in recall (Fig. 12.2).

Fig. 12.2
figure 2

Recall of bBMWH with and without James–Stein estimation with R0 = 0.30 for b = 1, 2, 3, 4 and original minwise hashing

Adding James–Stein estimation to bBMWH would be useful for applications where precision needs to be optimized over recall, for example, in spam detection where important emails wrongly classified as spam will not be seen by users.

12.4.3 Mean Square Error

MSE is not affected by the threshold value R0 chosen and is the same for all R0 values.

Adding James–Stein estimation to bBMWH decreased the MSE for all b values. At small sample sizes of k, where MSE is higher, adding James–Stein estimation to bBMWH was especially useful in reducing the MSE as compared to bBMWH alone (Fig. 12.3).

Fig. 12.3
figure 3

MSE of bBMWH with and without James–Stein estimation for b = 1, 2, 3, 4

bBMWH (for small values of b) does require more permutations than the original minwise hashing; for example, by increasing k by a factor of 3 when using b = 1, the resemblance threshold is R0 = 0.5. In the context of machine learning and b-bit minwise hashing, in some datasets, k has to be fairly large, e.g., k = 500 or even more, and this can be expensive [11]. This is because machine learning algorithms use all similarities, not just highly similar pairs.

Our results have potential applications in machine learning algorithms by achieving a low MSE without a need to increase the number of permutations k, thus saving on computational time and cost.

12.5 Conclusion and Implications

To our knowledge, we are the first to study the effect of adding James–Stein estimation to the b-bit minwise hashing algorithm.

At a low b-bit value of b = 4, the precision for bBMWH was high at 0.9 for a small sample size k < 100. Adding James–Stein estimation to bBMWH further increased the precision to 1.0 and decreased the recall to 0.8. For b = 3, James–Stein estimation increased the precision while not significantly decreasing the recall value. Adding James–Stein estimation to bBMWH would be useful for applications where precision needs to be optimized over recall, for example, in spam detection.

Detecting similar pairs of documents represented in a Bag of Words format is useful for search engines seeking to omit duplicate results in searches. For search engines, precision is important as a webpage wrongly marked as duplicate and omitted from search results will experience a decrease in site traffic, and users would be unable to obtain a complete search result. On the other hand, a small drop in recall that results in some duplicate pages not being omitted from search results will not significantly impact users’ experience. Thus, improving the precision of classification of similar pairs while sacrificing slight recall is useful.

Adding James–Stein estimation to bBMWH decreased the MSE for all b values in the experiment. At small values of k, where MSE is typically higher, adding James–Stein estimation to bBMWH was especially useful in reducing the MSE. Our results have potential applications in machine learning algorithms by achieving a low MSE without a need to increase the number of permutations k, thus saving on computational time and cost.