Abstract
b-bit minwise hashing (bBMWH) is an efficient hashing algorithm used in machine learning. The James–Stein (JS) estimator paradoxically produces a lower mean square error (MSE) than the traditional maximum likelihood estimator. Using 1000 documents from the Bag of Words Datasets (KOS) in the UCI Machine Learning Repository, we computed the pairwise resemblance for all documents in the dataset. We compared the performance of bBMWH with b from 1 to 4 bits with and without JS estimation, by calculating the precision, recall, F1-score, and MSE in classifying pairs with resemblance ≥ R0, with R0 from 0.30 to 0.60. Our results for R0 = 0.30 demonstrated that for b = 4 with JS estimation, the precision was high at 0.9 for a small sample size k < 100 and was maximized at 1.0 for higher k, while recall was decreased to 0.8. For b = 3, JS estimation improved precision without a significant drop in recall. JS estimation decreased the MSE of bBMWH for all b values investigated and especially for small k where MSE is higher. Our findings may be useful when precision is optimized over recall, e.g., spam detection. In cases where we want to estimate the pairwise resemblances for machine learning, bBMWH with JS estimation requires a smaller k to achieve the same MSE as bBMWH alone, thus saving computational time and storage space.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
12.1 Introduction
Many machine learning applications are faced with very large and high-dimensional datasets, resulting in challenges in scaling up training algorithms and storing the data [1]. Hashing algorithms such as minwise hashing [2, 3] and random projections [4, 5] reduce storage requirements and improve computational efficiency, without compromising on estimation accuracy. b-bit minwise hashing (bBMWH) [6, 7] is a recent progress for efficiently (in both time and space) computing resemblances among extremely high-dimensional binary vectors. bBMWH can be seamlessly integrated [1] with linear support vector machine and logistic regression solvers.
In traditional statistical theory, no other estimation is uniformly better than the observed average when applied to observations. The paradoxical element in James–Stein estimation is that it contradicts traditional statistical theory elemental law if there are three or more sets of data, even when the three sets are completely unrelated [8, 9]. For example, the unrelated datasets of the estimates of the average price of HDB flats in Singapore, the chance of rain in London, and the average height of Americans can be combined to obtain an estimate better than computing the estimates individually in terms of mean squared error. When first proposed, the James–Stein estimator seemed counterintuitive and illogical. However, it has been proven to have lower mean squared error than the traditional maximum likelihood estimator, when there are at least three parameters of interest [8, 9].
12.2 Hypothesis
In this study, we hypothesized that adding James–Stein estimation to bBMWH improves the precision, recall, and F1-score and decreases the mean square error of the estimate from the hashing algorithm.
12.3 Materials and Methods
We briefly review the following: James–Stein estimation [9], minwise hashing [2, 3], and b-bit minwise hashing [7].
12.3.1 James–Stein Estimation
Given a random vector \({{\varvec{z}}} \tilde \ {\mathcal{N}}_N \left( {{{\varvec{\mu}}}, I} \right) \), the James–Stein estimator is defined to be
where \(S = \vert \varvec{z} \vert ^{2}\).
\(N\) is the number of true means we want to estimate across datasets.
For \(N = 3\), \({{\varvec{\mu}}}\) could be a vector containing the true average price of HDB flats in Singapore, true chance of rain in London, and the true average height of Americans. Given some observations, we want to estimate \({{\varvec{\mu}}}\) with \({\hat{\user2{\mu }}}\).
The maximum likelihood estimator (MLE) for \({{\varvec{\mu}}},{\varvec{ \hat{\mu }}}^{\left( {{{\varvec{MLE}}}} \right)}\) maximizes a likelihood function under an assumed statistical model, so that the observed data is most probable. The likelihood of a N-variate normal distribution has a closed form and thus can be maximized by using numerical methods such as the Newton–Raphson method to obtain the roots of its derivative.
The following theorem is taken from [9] and restated here:
Theorem 1
For \(N \ge 3\) , the James–Stein estimator dominates the MLE \({{\varvec{\mu}}}\) in terms of expected total squared error that is
for every choice of \({{\varvec{\mu}}}\).
12.3.2 Minwise Hashing
Computing the size of set intersections is a fundamental problem in information retrieval, databases, and machine learning. Given two sets, \(S_1\) and \(S_2\), where
a basic task is to compute the joint size \(a = \left| {S_1 \cap S_2 } \right|\), which measures the (un-normalized) similarity between \(S_1\) and \(S_2\).
The Jaccard similarity or resemblance, denoted by R, provides a normalized similarity measure:
Computation of all pairwise resemblances takes \({\mathcal{O}}\left( {N^2 D} \right)\) time, as one would need to iterate over all \(\left( {\begin{array}{*{20}c} N \\ 2 \\ \end{array} } \right)\) pairs of vectors and for each pair of vectors, over all \(D\) elements in the set.
In most cases, \(D\) is sufficiently big to make direct computation infeasible.
The original minwise hashing method [2, 3] has become a standard technique for estimating set similarity (e.g., resemblance). We briefly restate the algorithm here as follows:
Suppose a random permutation \(\pi\) is performed on \({\Omega }\), i.e.,
A simple probability argument shows that
After \( k\) minwise independent permutations, \(\pi_1 ,\pi_2 , \ldots , \pi_{k } ,\) one can estimate \(R\) without bias, as a binomial probability,
This reduces the time complexity to \({\mathcal{O}}\left( {N^2 k} \right)\) where k is the number of permutations, thus reducing the time taken while sacrificing some accuracy.
12.3.3 b-Bit Minwise Hashing
By only storing the lowest b bits of each (minwise) hashed value (e.g., b = 1 or 2), b-bit minwise hashing can gain substantial advantages in terms of computational efficiency and storage space [7].
The following theorem is taken from the paper on b-bit minwise hashing by Li and Konig [7], which we restate here as follows:
Theorem 2
Define the minimum values under \(\pi\) to be \(z_i\) and \(z_2\):
Define \(e_{1,i} \) and \(e_{2,i}\) to be the \( i\) th lowest bit of \(z_1 \) and \(z_2\), respectively.
Assuming D is large,
where
\(\hat{R}_b\) is an unbiased estimator of \(R\):
where \(e_{1,i, \pi_j }\) and \(e_{2,i, \pi_j } \) denote the \(i\) th lowest bit of \(z_1, z_2\) under the permutation \(\pi_j\), respectively.
Following property of binomial distribution, we obtain
12.3.4 Experiment
We used Python 3.7.10 with vectorization to implement bBMWH and James–Stein estimation. We also used it to plot our graphs of the results. We computed the precision, recall, F1-score, and MSE at various R0 values, using bBMWH with b = 1,2,3,4 bits with and without James–Stein estimation and also the original minwise hashing. We aimed to determine the smallest bit possible to save storage space and improve computational efficiency, while maintaining good levels of precision and recall.
Our experiment adopted a similar methodology as the Experiment 3 in the landmark b-bit minwise hashing paper by Li and Konig [7].
The word dataset used is a collection of the first 1000 documents (499,500 pairs) from the Bag of Words Dataset (KOS) in the UCI Machine Learning Repository [10].
We represented the i-th document as a binary vector Xi of size w, the total number of distinct words in the dataset. For this dataset, w = 6906. The j-th element of Xi will be 1 if the word occurs in the document and 0 otherwise.
We then computed the true pairwise resemblances for all documents in the dataset using the binary vectors and counted the number of pairs with R ≥ R0 (Table 12.1).
We conducted our experiment for R0 ∈ {0.3, 0.4, 0.5, 0.6} to represent the range covered in the abovementioned experiment. We ran bBMWH to compute the estimate of pairwise resemblances between vectors in X, represented as a square matrix. We then took the upper triangular portion of this matrix and flattened it to get a vector of \(\left( \frac{N}{2} \right)\) elements res, representing the list of pairwise resemblances. We then ran James–Stein estimation on this vector which shrank the results toward 0, to obtain another vector jsres.
We compared these estimates to the true resemblances by selecting all elements with R ≥ R0 and computing the precision and recall of these estimates in identifying pairs with R ≥ R0.
Using the precision and recall, we calculated the F1-score. We also calculated the mean squared error (MSE) of these estimates with the true resemblance. This was done for k = 500 permutations and averaged over 100 iterations.
12.4 Results and Discussion
We present some findings for R0 = 0.30 which are more significant here.
All experiments were done for k = 500 permutations and averaged over 100 iterations.
We plotted all graphs with error bars to represent the variance in our obtained values instead of relying on point estimates.
12.4.1 Precision
For b ≤ 2, the precision for both bBMWH and bBMWH combined with James–Stein estimation was low at less than 0.2. This agrees with previous research [7] where using b = 1 bit per hashed value yields a sufficiently high precision only when R0 ≥ 0.5.
At b = 3, the precision for bBMWH increased to 0.8 even for a small k = 100. Adding James–Stein estimation to bBMWH further increased the precision.
At b = 4, the precision for bBMWH increased to 0.9 for k < 100. Adding James–Stein estimation to bBMWH further increased the precision to near 1.0 (Fig. 12.1).
12.4.2 Recall
A high recall of more than 0.9 was obtained for bBMWH alone and bBMWH with James–Stein estimation at the lowest bit of b = 1. This result is in accordance with previous research [7] where recall values are all very high even when very few bits are used and R0 is low. The recall values are all very high and do not well differentiate between bBMWH with or without James–Stein estimation (except for low values of k).
However, for b = 4, the recall value decreased to 0.8 when James–Stein estimation was added. An increase in precision with James–Stein estimation results in a decrease in recall (Fig. 12.2).
Adding James–Stein estimation to bBMWH would be useful for applications where precision needs to be optimized over recall, for example, in spam detection where important emails wrongly classified as spam will not be seen by users.
12.4.3 Mean Square Error
MSE is not affected by the threshold value R0 chosen and is the same for all R0 values.
Adding James–Stein estimation to bBMWH decreased the MSE for all b values. At small sample sizes of k, where MSE is higher, adding James–Stein estimation to bBMWH was especially useful in reducing the MSE as compared to bBMWH alone (Fig. 12.3).
bBMWH (for small values of b) does require more permutations than the original minwise hashing; for example, by increasing k by a factor of 3 when using b = 1, the resemblance threshold is R0 = 0.5. In the context of machine learning and b-bit minwise hashing, in some datasets, k has to be fairly large, e.g., k = 500 or even more, and this can be expensive [11]. This is because machine learning algorithms use all similarities, not just highly similar pairs.
Our results have potential applications in machine learning algorithms by achieving a low MSE without a need to increase the number of permutations k, thus saving on computational time and cost.
12.5 Conclusion and Implications
To our knowledge, we are the first to study the effect of adding James–Stein estimation to the b-bit minwise hashing algorithm.
At a low b-bit value of b = 4, the precision for bBMWH was high at 0.9 for a small sample size k < 100. Adding James–Stein estimation to bBMWH further increased the precision to 1.0 and decreased the recall to 0.8. For b = 3, James–Stein estimation increased the precision while not significantly decreasing the recall value. Adding James–Stein estimation to bBMWH would be useful for applications where precision needs to be optimized over recall, for example, in spam detection.
Detecting similar pairs of documents represented in a Bag of Words format is useful for search engines seeking to omit duplicate results in searches. For search engines, precision is important as a webpage wrongly marked as duplicate and omitted from search results will experience a decrease in site traffic, and users would be unable to obtain a complete search result. On the other hand, a small drop in recall that results in some duplicate pages not being omitted from search results will not significantly impact users’ experience. Thus, improving the precision of classification of similar pairs while sacrificing slight recall is useful.
Adding James–Stein estimation to bBMWH decreased the MSE for all b values in the experiment. At small values of k, where MSE is typically higher, adding James–Stein estimation to bBMWH was especially useful in reducing the MSE. Our results have potential applications in machine learning algorithms by achieving a low MSE without a need to increase the number of permutations k, thus saving on computational time and cost.
References
Li, P., Shrivastava, A., Moore, J., & König, A. C. (2011). Hashing algorithms for large-scale learning. In Proceedings of the 24th international conference on neural information processing systems (pp. 2672–2680), December 2011.
Broder, A. Z. (1997). On the resemblance and containment of documents. In Proceedings of the compression and complexity of sequences (pp. 21–29), June 1997.
Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166.
Kang, K., & Hooker, G. (2017). Random projections with control variates. In Proceedings of 6th international conference on pattern recognition application methods (ICPRAM) (pp. 138–147).
Kang, K. (2017). Random projections with Bayesian priors. In Proceedings of national CCF conference on natural language processing and Chinese computing (NLPCC), Dalian, China, pp. 170–182, Nov 2017.
Li, P., & König, A. C. (2011). Theory and applications of b-Bit minwise hashing. Communications of The ACM—CACM, 54(8), 101–109.
Li, P., & König, A. C. (2010). b-bit minwise hashing. In Proceedings of 10th international conference on WWW, Raleigh, NC, pp. 671–680, Apr 2010.
Efron, B., & Morris, C. (1977). Stein’s paradox in statistics. Scientific American, 236(5), 119–127.
Efron, B. (2010). Empirical Bayes and the James—Stein Estimator. In Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction (Institute of Mathematical Statistics Monographs) (pp. 1–14). Cambridge: Cambridge University Press.
UCI Machine Learning Repository Centre for Machine Learning and Intelligent Systems. Bag of Words Dataset (KOS). https://archive.ics.uci.edu/ml/datasets/bag+of+words. Irvine, CA: University of California, School of Information and Computer Science. Last retrieved, 2 January 2021.
Li, P., Shrivastava, A., & König, A. C. (2013). b-Bit minwise hashing in practice. In Proceedings of 5th Asia-Pacific symposium on internetware (no. 13, pp. 1–10), Oct 2013.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Toh, J.E.D., Kan, R.X.M., Kang, K. (2022). Applying James–Stein Estimation to b-Bit Minwise Hashing. In: Guo, H., Ren, H., Wang, V., Chekole, E.G., Lakshmanan, U. (eds) IRC-SET 2021. Springer, Singapore. https://doi.org/10.1007/978-981-16-9869-9_12
Download citation
DOI: https://doi.org/10.1007/978-981-16-9869-9_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9868-2
Online ISBN: 978-981-16-9869-9
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)