1 Introduction

K-nearest neighbors searching (KNNS) is to find the K closest objects to an object. It is a primary problem in clustering analysis [5, 8,9,10], classification [35, 36], outlier detection [15]. It has widely applications in text information retrieval, search engine, content-based information query, duplication detection, etc.

The brute-force method for KNNS first computes the distances between the query point and every other point and then selects K points with the minimum distance. However, it has high computational cost especially for big data.

In order to overcome this shortcoming, some tree-based indexes such as KD-tree [29], M-tree [11] cover tree [2] are used to accelerate KNNS. In [6], the authors propose a fast exact KNNS method on the basis of semi-convex hull tree and GPU. Nevertheless, these tree-based methods employ bisection techniques to construct the tree structure and the partition method may not be suitable for high-dimensional data. The reason is that the internal nodes in high-dimensional data [22, 23] have high overlap with other nodes.

Some researchers construct flat-indexes instead of tree-structure by using clustering algorithms to search K-nearest neighbors. In [32], the authors propose KMKNN algorithm for KNNS. It first divides the data into different clusters with k-means algorithm, then selects K-nearest neighbors from the closest cluster to the query object and avoids unnecessary distance computations by using triangle inequality. However, the clusters produced by K-means may be undesired, so that the KNNS result of KMKNN is undesired. In [1], the authors present a novel KNNS method on the basis of various-widths clustering, called kNNVWC. It produces compact and well separated clusters for high-dimensional data by using a novel partitioning clustering algorithm, improving the performance of KNNS.

Since the dimensionality curse makes the exact nearest neighbors searching nearly infeasible to achieve, some approximate nearest neighbor searching algorithms [24] are proposed by researchers. Approximate strategy is a good way to balance accuracy and efficiency. In [7], FLANN with the priority k-means tree, an approximate KNNS method, is introduced to improve DBSCAN. Locality sensitive hashing (LSH)-based method is also an approximate KNNS algorithm that is often used for high-dimensional data. LSH constructs hash tables by employing a family of hash functions to project similar objects into the same bucket and different objects into different buckets. When searching K-nearest neighbors of a query point, LSH-based method only needs to compare the distances between the query point and the points in the same bucket with the query point, which greatly reduces the search range. The distance between objects determines the hash functions, which makes the problem of “curse of dimensionality” easy to solve.

Among the existing LSH-based KNNS methods, the p-stable LSH [13] is a popular one. It can directly work on points in Euclidean space. The authors have proved that if the data satisfies the “bounded growth”, the p-stable LSH will find its exact K-nearest neighbors in O(log n) time. However, to maintain high accuracy of p-stable LSH, large memory are required for storing hash tables. To solve this problem, Lu et al. [21] present a randomness-based LSH for KNNS, called RLSH. When searching K-nearest neighbors, it only selects a hash table, instead of all the hash tables, to project the query object. Then, it selects candidate objects from the same bucket that the query object belongs to. It has been proved that RLSH promotes the quality of the candidates even if a few hash tables are used. However, for these methods, the parameters of constructing LSH index, such as the number of hash tables L and the number of hash functions in each hash table k, will affect the result of KNNS.

To solve the above mentioned problem, we propose a robust LSH algorithm for KNNS, called Robust-LSH. It is inspired by the idea that objects closer to the query object have a higher collision probability in many hash tables, that is, the query object and its nearest neighbors appear frequently in many hash tables. We find the objects frequently appear with the query object, which is called candidate seed. Then, objects frequently collide with objects in the candidate seed are added to the candidate set. Robust-LSH can use less hash tables to search K-nearest neighbors with high quality and less query time. The experiments by comparing with KD-tree, p-stable LSH and RLSH illustrate the advantage of the proposed algorithm.

The rest of this paper is organized as follows. Section 2 reviews the related work about LSH. Section 3 introduces the p-stable LSH. After that, we present the proposed algorithm Robust-LSH in Sect. 4. Experiments and analysis are shown in Section 5. Section 6 concludes this work and discusses the future work.

2 Related work

In this section, we will introduce KNNS methods based on data structures and LSH.

To search K-nearest neighbors, several indexing data structures are proposed, such as KD-tree, M-tree, cover tree, to name a few. These tree-based algorithms build the tree by employing binary partition strategy and the similar objects are in the same leaf node. The tree-based indexes is built by recursively partitioning the data space until all the data are contained in a node. However, the binary partitioning method may not be appropriate for data whose dimensionality is high [1]. Then, these tree-based KNNS methods fail to process high-dimensional data, because they do not scale well with the increase of the dimension [33].

In order to quickly obtain the K-nearest neighbors of high-dimensional and large-scale data, approximate KNNS methods are proposed. LSH is the most notable method [19]. It was first proposed by Indyk et al.[16, 17]. Its main idea is to use hash functions to map the data into hash buckets so that similar objects have a high probability of being in the same bucket. The hash functions are dependent on different distance measures, such as l1 distance [17], lp distance [13], the Jaccard similarity [3] and the kernel functions [19]. An LSH for angles (i.e. cosine similarity in Euclidean space) is presented by Charikar [4], which is called SIMHASH. The authors in [25] propose an entropy-based LSH to reduce the space consumption of SIMHASH. In [12], the authors use randomized Hadamard transforms in a non-linear setting to quickly estimate the Euclidean distance for high-dimensional data. Although the space requirement is reduced, its query time is increased. On the basis of stable distribution for lp norms, Datar et al. [13] developed the p-stable LSH.

In order to maintain high recall and precision for KNNS, LSH-based method must store enough hash tables. However, as the scale of data grows, hash tables take up more space, which will influence the efficiency of searching K-nearest neighbors. Therefore, some improved methods, like the multi-probe strategy, the binary code’s storage form, and the distributed LSH methods, are proposed.

The multi-probe LSH [26] is one of the methods employing multi-probe strategy. It increases the number of candidate points for a query point by looking up multiple hashing buckets that more likely contains K-nearest neighbors, so that the recall of K-nearest neighbors searching is improved, and the space and time cost is reduced. Gu et al. [14] improve the probe sequence by using a new probability model and a query-adaptive algorithm. Entropy-based LSH [25] is also a method to utilize multi-probe strategy. RLSH (Randomness-based locality-sensitive hashing) [21] randomly selects a hash table, instead of all the hash tables, to project the query object and reconstructs the candidate points for searching K-nearest neighbors. The multi-probe strategy-based algorithms reduce the memory consumption and keep comparable recall and precision.

There are also some methods convert each point into a compact binary code to reduce memory consumption, such as spectral hashing [20, 34], semi-supervised hashing [30] and self-supervised hashing [31]. Spectral hashing uses spectral graph partitioning method to learn the binary codes, which shows promising performance. Wang et al. [30] propose semi-supervised hashing method. It tries to minimize empirical error over the labeled data, and maximize variance and independence of hash bits over the labeled and unlabeled data, so that efficient hash codes are obtained. In [31], the authors greatly reduce the memory demand with a self-supervised hashing method.

Some researchers focus on developing distributed or parallel LSH algorithms. Koga et al. [18] implement LSH in distributed way. In [37], the authors propose a hadoop and collision counting based KNN algorithm, called H-c2KNN. The algorithm uses MapReduce framework to realize KNNS for high-dimensional data. It fully exploits the occurrence information of the LSH. some algorithms are proposed to solve the problems of parameters setting in constructing LSH index. Slaney et al. [28] use two histograms to optimize the parameters.

Locality-sensitive hashing methods provide LSH index for searching K-nearest neighbors of large scale and high-dimensional data. However, it requires huge space and time cost when searching nearest neighbors in candidate points, and the difficulty of setting appropriate parameters has impact on their performance. We propose a robust LSH algorithm for KNNS, which is robust to the parameters setting, and reduces the space requirement and time cost.

3 The p-stable LSH method

LSH transforms points from the original space to a new space. If two points are close to each other in the original space, it has large probability that they are mapped into the same bucket, while two points that are far from each other have a small probability of being mapped to the same bucket. LSH is used to map the original data to a new space, and nearest neighbors are often in the same bucket. Then, searching K-nearest neighbors on the whole data is transformed into searching K-nearest neighbors on a small data that is in the same bucket with the query object after mapping. The p-stable LSH [13] is a widely used LSH method. It can be directly used for data in Euclidean space without any embedding.

This section introduces p-stable LSH in detail. First, it constructs multiple hash functions from a certain hash family based on p-stable distribution. Then, only the objects that is projected into the same hash buckets as the query object are considered for inclusion in the query range. Finally, it computes the distances between the objects in the query range and the query object and obtains K-nearest neighbors of the query object by sorting the distances.

In statistics, if the linear combination of two random variables from independently and identically distribution has the same distribution as themselves, then, the random variables obey a stable distribution. Next, we will introduce p-stable distribution.

Definition 1

(p-stable distribution). For any n real data, \(r_1\), \(r_2\), ..., \(r_n\) and n variables, \(X_1\), \(X_2\), ..., \(X_n\) following the independent and identically distribution D, if there exists \(p\ge 0\) such that the random variables \(\sum \nolimits _i {{r_i}{X_i}}\) and \((\sum \nolimits _i {|{r_i}{|^p}{)^{1/p}}X}\) are from the same distribution, then the distribution D is called p-stable distribution.

It has been proved that for any \(p \in (0,2]\), the stable distribution exists. A Gaussian distribution \(D_G\) with the density function \(g(x) = \frac{1}{{\sqrt{2\pi } }}{e^{\frac{{ - {x^2}}}{2}}}\) is a 2-stable distribution. Therefore, p-stable LSH often uses Gaussian distribution to generate the hash functions.

Definition 2

(Locality sensitive hash). Any two points p, q are in the d-dimensional real data space. A function family \(F=\{f: S\rightarrow U\}\) is \((d_1, d_2, p_1, p_2)\)-sensitive hash family if it satisfies the conditions that when the distance between a and b in original space S is less than \(d_1\), the probability of collision between a and b in new space U (i.e. \(f(p)=f(q)\)) is larger than \(p_1\), and when the distance between a and b in the original space S is larger than \(d_2\), the probability of collision between p and q (i.e. \(f(p)=f(q)\)) is less than \(p_2\). Where f(p) is the hash value of p after being mapped by \(f\in F\).

It should be noted that the hash family F can be used if it satisfies that \(p_1>p_2\) and \(d_1<d_2\). It means that two points whose distance is less than \(d_1\) are more likely to collide. LSH often concatenates multiple hash functions generated from a certain hash family to avoid false detection.

For a given integer k, we obtain the concatenating function \(g(p)=(f_1(p), f_2(p), , f_k(p))\). The k hash functions are from the hash family \(F=\{f: S\rightarrow U^k\}\), where \(f_i\in F\). Therefore, when the distance between p and q is less than \(d_1\), the probability of collision between p and q in new space U (i.e. \(g(p)=g(q)\)) is larger than \({p_1}^k\), and when the distance between p and q is larger than \(d_2\), the probability of collision between p and q (i.e. \(g(p)=g(q)\)) is less than \({p_2}^k\).

Concatenating multiple hash functions decreases the collision probability of two points far apart, but at the same time it may reduce the collision probability of two close points. Therefore, multiple hash tables are required for solving this problem. We use L functions g, i.e., \(g_1\), \(g_2\), ..., \(g_L\), and each function \(g_i\) is the concatenating of k hash functions. Each hash function is computed as:

$$\begin{aligned} {f_{a,b}}(v)\mathrm{{ = }}\lfloor \frac{{a \cdot v + b}}{W}\rfloor \end{aligned}$$
(1)

In Eq. 1, v is the object in a d-dimension real data space, a and b are the parameters of a hash function, and W is the width of projection. a is a vector having the same dimensions as v and it is generated from Gaussian distribution. b is a real number, representing offset value, and it is generated from uniform distribution [0, W].

The steps of searching K-nearest neighbors based on p-stable LSH mainly include constructing hash tables and searching nearest neighbors for the query point. When constructing one hash table, it first randomly generates k vectors a and offset b. After that, each point is projected with Eq. 1 and obtained k hash values. Then, the k hash values are concatenated and converted to one integer. In a hash table, the points that are close to each other are assigned to the same bucket. Then it repeats the above steps L times to build L hash tables. The detailed process of constructing LSH index is shown in Algorithm 1.

Figure 1 gives an example of the hash table. Figure 1(a) is the original data and Figure 1(b) is a hash table obtained by mapping the original data. We can learn from the figure that points 1, 6, 7, 3, etc are mapped to the same bucket because their distance is relatively close. Similarly, points 10, 13, 22 and 19 are assigned to the same bucket. However, there are no other points who are assigned to the same bucket with points 26, 27 and 28.

When searching nearest neighbors, p-stable LSH method uses the same hash functions in each hash table to compute the hash value of the query point and obtains the candidates by finding the union of points having the same hash value as the query point in L hash tables. Then, the distances between the candidates and the query point are calculated and sorted to get K-nearest neighbors of the query point. The detailed procedure of searching K-nearest neighbors is described in Algorithm 2.

Fig. 1
figure 1

An example of hash table. a The original data; b The corresponding hash table

From the above algorithms, we can learn that the influence of p-stable LSH for searching K-nearest neighbors is influenced by three parameters: k, W, and L. A larger value of k requires more time of calculating hash values and decreases the probability that points who are far apart are assigned into the same bucket. The value of W affects the probability of collision for any two points and a bigger value will increase the probability, while a smaller value will reduce the probability. The bigger the value of L is, the higher the accuracy of K-nearest neighbors searching is, but it takes a longer time to construct the hash tables.

figure e

The time consumption of p-stable LSH mainly contains the time of constructing hash tables, obtaining candidates and computing the distances. Let \(T_p\) denote the time consumption of a single projection operation, \(T_u\) denote the time of an union operation, and \(T_d\) denote the time for calculating the distance between each pair of points. The time cost of constructing hash tables is \(n*k*L*T_p\), where n is the data size. In the process of obtaining candidates, it includes mapping the query point and finding the union of points having the same hash value as the query point. Its running time is \(k*L*T_p+L*T_u\). Assuming the number of candidates is \(n_c\), then the time cost of computing distances is \(n_c*T_d\). Thus, the time consumption of p-stable LSH is \(n*k*L*T_p+ k*L*T_p+L*T_u+ n_c*T_d\). The space consumption of p-stable LSH is mainly for storing the hash tables, and it is \(n*L\).

figure f

4 The proposed method

The p-stable LSH algorithm has to set three parameters k, L and W to construct LSH index. To keep high accuracy of searching K-nearest neighbors, it has to construct multiple hash tables to obtain enough candidates, which requires huge space consumption. Besides, the p-stable LSH always selects points that having the same hash value with the query point which limits the diversity of the candidates. To solve these problems, we propose a robust LSH algorithm, called Robust-LSH. It is inspired by the idea that points close to the query point have more collisions in different hash tables. Based on this idea, we improve the strategy for KNNS. The proposed algorithm can obtain the same accuracy with fewer hash tables, which reduces the space consumption.

4.1 Robust-LSH method

The proposed method first constructs hash tables like p-stable LSH, and then uses a different strategy for KNNS. In the process of KNNS, p-stable LSH method only takes advantage of the points that having the same hash value as the query point. Considering that points that frequently appear together with the query point in multiple hash tables have high value in searching K-nearest neighbors, we improved the diversity of candidates by employing these points, which is called candidate seeds in this paper, and uses these candidate seeds to obtain a high quality candidate set for searching K-nearest neighbors.

Definition 3

(Candidate seeds). For a query point q, its candidate seeds are points that having the same hash value as q in multiple hash tables.

To find the candidate seeds of the query point, we repeatedly randomly select several hash tables to compute the hash value of the query point, and points that having the same hash value with the query point are chosen for intersection operation. When the number of points in the intersection is less than K (K is the number of nearest neighbors), it terminates. The detailed algorithm of finding the candidate seeds is shown in Algorithm 3.

After obtaining the candidate seeds, we find the neighbors of each candidate seed. Then, these neighbors are combined into a candidate set for searching K-nearest neighbors of the query point. Next, we give the definitions of the neighbors of a candidate seed and candidates of a query point.

figure g

Definition 4

(The neighbors of a candidate seed). For a candidate seed \(S_i\), its neighbors are the points having the same hash value as \(S_i\) in a random hash table \(H_r\), which is denoted as \(SCS_i\).

Definition 5

(Candidates of a query point). For a query point q, its candidates are the union of the neighbors of every candidate seed, which is denoted as \(Candidates(q) = \mathop \cup \limits _{i = 1}^{ns} SC{S_i}\) , where ns is the number of candidate seeds.

The proposed algorithm includes two steps: constructing hash tables and searching K-nearest neighbors for query points. Our work is mainly to improve the second step. Candidate seeds are points frequently appear together with the query points, and it is valuable for obtaining K-nearest neighbors of the query point. Therefore, making full use of candidate seeds will improve the diversity of candidates, so that we can use fewer hash tables to obtain more valuable candidates for KNNS. The process of Robust-LSH for KNNS is detailed in Algorithm 4.

figure h

4.2 Complexity analysis

For the proposed algorithm, its running time is mainly spent on constructing hash tables, obtaining the candidates, and computing the distances between the query point and the candidates. Let \(T_p\) denote the time consumption of a single projection operation, \(T_i\) denote the upper bound of the cost for an intersection operation, \(T_u\) denote the cost for an union operation, and \(T_d\) denote the time for calculating the distance between each pair of points. When constructing hash tables, each point needs to be projected \(k*L\) times, then the time consumption of constructing hash tables is \(n*k*L*T_p\) (n is the number of objects in a data). To obtain the candidates of the query point, we need to randomly select several hash tables to project the query point and obtain candidate seeds through the intersection operation. If the number of selected hash tables is \(L'\) (\(L'<L\)), the time consumption for obtaining candidate seeds is \(k*L'*T_p+L'*T_i\). If the number of candidate seeds is \(n_s\) (\(n_s\ll n\)), then the upper bound time required for the union operation is \(n_s*T_u\). Thus the running time of obtaining the candidates of the query point is \(k*L'*T_p+L'*T_i+ n_s*T_u\). The time for calculating the distances is \(n_c*T_d\), where \(n_c\) (\(n_c\ll n\)) is the number of candidates. Thus, Robust-LSH’s time consumption is \(n*k*L*T_p+ k*L'*T_p+L'*T_i+ n_s*T_u+ n_c*T_d\). The space of the proposed algorithm is mainly used for storing hash tables, therefore its space complexity is \(n*L\).

Compared with the p-stable LSH, although Robust-LSH requires more time to find candidate seeds, it only needs to build a few hash tables to keep the accuracy of K-nearest neighbors searching.

5 Experiments and analysis

The performance of the proposed algorithm Robust-LSH is evaluated by comparing with KD-tree, p-stable LSH and RLSH on synthetic and real datasetsFootnote 1. KD-tree method is a typical tree-based nearest neighbors searching algorithm. It constructs tree structure for exact nearest neighbors searching. The p-stable LSH is a popular LSH method. RLSH is an improved p-stable LSH algorithm.

First, we prove the robustness of our method to the parameters when constructing LSH hash tables. Then, we explore the influence of data size and dimension on the algorithms. Finally, we use two real datasets to illustrate the robustness of Robust-LSH.

We quantitatively evaluate the quality of KNNS by using accuracy (ACC). Given a query point q, A(i) represents the actual i-th nearest neighbor of q and R(i) represents the i-th nearest neighbor of q obtained by the algorithm. ACC is calculated Eq. 2.

$$\begin{aligned} ACC(q) = \frac{{\sum \nolimits _{i = 1}^K {\chi (A(i) - R(i))} }}{K} \end{aligned}$$
(2)

In the above equation, \(\chi (x) = \left\{ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {1,}&{}{x = 0} \end{array}}\\ {\begin{array}{*{20}{c}} {0,}&{}{x \ne 0} \end{array}} \end{array}} \right..\) ACC means the proportion of correct neighbors to actual neighbors. The range of ACC is [0, 1]. When ACC equals to 1, it tells that all the correct K nearest neighbors are found by the algorithm. We do experiments on a PC with Intel Core i7 1.8GHz, 8GB RAM, Windows 10 and MATLAB R2013a.

5.1 The influence of hash tables L

A Gaussian distribution is used to randomly generate a dataset containing 20000 objects and each object has 100 dimensions. 20 objects are selected as the query points and the 100-nearest neighbors of each query point are used to compute ACC of algorithms.

When constructing LSH index, we have to set three parameters: k, W, and L. To explore the influence of L for p-stable LSH, RLSH and Robuts-LSH, we fix k and W, and range the number of hash tables L form 5 to 50. According to [13], W is fixed as 4 and k is set to 5.

The ACC and running time are used to evaluate the performance of the algorithms. Since RLSH and Robust-LSH randomly select hash tables to construct candidates for the query point, each query (containing 20 query points) is repeated 10 times. Table 1 shows the mean and standard deviation of ACC and running time of 10 experiments. The average ACC and running time are shown in Figure 2.

The standard deviations in Table 1 illustrate that the ACC and running time of p-stable LSH and Robust-LSH are stable. From Figure 2(a), we can learn that as for p-stable LSH, its ACC score increases with hash tables when the number of hash tables is less than 25. Especially, when there are only a few hash tables, its ACC score is far less than that of RLSH and Robust-LSH. For RLSH and Robust-LSH, even though a few hash tables are used, their ACC scores are higher than 0.9. When the hash tables is greater than 10, the ACC score of Robust-LSH always maintains 1, which shows that Robust-LSH is robust to the parameter L. Besides, it means that we can use fewer hash tables to get good results and fewer hash tables means less storage space. From Figure 2(b), we can learn that Robust-LSH takes a little longer time than p-stable LSH and far less time than RLSH. Therefore, our algorithm Robust-LSH is much better than p-stable LSH and RLSH in terms of the storage space, query accuracy and running time.

Table 1 The ACC and running time of algorithms with different L settings
Fig. 2
figure 2

The impact of hash tables on ACC and running time for p-stable LSH, RLSH and Robust-LSH. a The ACC with the increase of hash tables. b The running time with the increase of hash tables

5.2 The influence of data size and dimension

We also compare our algorithm with KD-tree and p-stable LSH on Gaussian distribute datasets to illustrate the robustness of the proposed algorithm to the dimension and size of data. KD-tree algorithm is a typical tree-based method for searching K-nearest neighbors. When doing experiment on large-scale data, since RLSH algorithm takes a lot of time to obtain the result, we only select p-stable LSH method for comparison. Like the first experiment, we repeat each query for 10 times, and each time we try to find 20 query points’ 100 nearest neighbors, and use the average ACC and running time of 10 times as the evaluation index for the performance of algorithms.

First, we randomly generate 100000-sized Gaussian distribution data, and the dimension gradually increases from 100 to 1000. According to the first experiment, we can learn that when L is set to 10, it not only saves the space storage, but also achieve a higher ACC value for the proposed method. Therefore, in this experiment, we fixed L as 10. The parameter k is also set to 5. Since the data scale is much larger than that of the first experiment, we increase W to 10. Table 2 lists the mean and standard deviation of ACC and running time. Besides. Figure 3 shows the average ACC and running time as the dimension of data increases.

The standard deviations of Robust-LSH in Table 2 are small, which tells that the results of the proposed algorithm is stable. From Figure 3(a), we find that as the dimension increases, the ACC scores of KD-tree and our method are close to 1, but the ACC of p-stable LSH decrease from 0.996 to 0.0667. KD-tree is an exact method for searching K-nearest neighbors, and our method is an approximate method. From this result, our method almost obtains the same accurate K-nearest neighbor as KD-tree. Figure 3(b) tells that the running time of KD-tree increases much faster than p-stable LSH and our method with the increase of dimension. Therefore, considering ACC and running time, Robust-LSH is rarely affected by the dimensions.

Table 2 The ACC and time of algorithms with different dimensions
Fig. 3
figure 3

The impact of dimensions on ACC and running time for KD-tree, p-stable LSH and Robust-LSH. a The ACC with the increase of dimensions. b The running time with the increase of dimensions

Then, we randomly generate 1000-dimensional Gaussian distribution data, and the size of data ranges from 10000 to 100000. Like the above experiment, three parameters L, k and W are set to 10, 5, and 10, respectively. Table 3 shows the mean and standard deviation of ACC and running time with different sizes. Figure 4 shows the average ACC and running time as the size of data increases.

Figure 4(a) displays the ACC scores of the algorithms with the increase the data size and Figure 4(b) displays the running time of the algorithms. It can be seen that the ACC scores of p-stable LSH are kept lower than 0.1, which proves that p-stable LSH is not effective for these data. Since KD-tree is an exact method, its ACC scores maintain 1. The ACC scores of Robust-LSH are always higher than 0.97 and close to 1, which illustrates its effectiveness when searching K-nearest neighbors. The running time shown in Figure 4(b) tells that the running time of the three algorithms increases as the data size grows. However, the growth rate of KD-tree is much faster than p-stable LSH and Robust-LSH. Although Robust-LSH takes a little longer time than p-stable LSH, its ACC scores are much higher than p-stable LSH. Hence, combining these two indicators, Robust-LSH outperforms KD-tree and p-stable LSH.

Table 3 The ACC and time of algorithms with different sizes
Fig. 4
figure 4

The impact of dimensions on ACC and running time for KD-tree, p-stable LSH and Robust-LSH. a The ACC with the increase of data size. b The running time with the increase of data size

5.3 The experiment on real datasets

We further use two real datasets Patches and MNIST, which are from [27], to illustrate the effectiveness of Robust-SLH. Patches consists of 59500 points and each has 400 dimensions. MNIST contains 60000 points and each has 784 dimensions. The compared algorithms include p-stable LSH, RLSH and KD-tree. When constructing LSH index for p-stable LSH, RLSH and Robust-LSH, we set L to 10 like the previous setting, then roughly estimate W by divide the range of the data by 4, and range the number of hash functions k from 4 to 10. We still use 20 query points’ 100 neighbors to compute ACC scores of algorithms. The average ACC and running time of 10 experiments are used as the evaluate indexes. The running time of p-stable LSH, RLSH and Robust-LSH includes that of building the hash tables and searching K-nearest neighbors. Table 4 lists the mean and standard deviations of ACC and running time for Patches and Table 5 lists the mean and standard deviations of ACC and running time for MINIST.

The results in Table 4 and 5 tell that smaller k value takes longer time and obtains higher ACC score. The reason is that smaller k value produces more candidates for searching K-nearest neighbors. With the increase of k, ACC scores of three LSH methods decrease. Specifically, for Patches, p-stable LSH decreases from 1 to 0.565, RLSH decreases from 1 to 0.926, and Robust-LSH decrease from 1 to 0.941; for MINIST, p-stable LSH decreases from 0.975 to 0.310, RLSH decreases from 1 to 0.825, and Robust-LSH decrease from 1 to 0.864. Obviously, the ACC scores of Robust-LSH drop the slowest compared with p-stable LSH and RLSH, which demonstrates that Robust-LSH is robust to the parameter k. Besides, no matter what the value of k is, the proposed method obtains higher ACC scores than p-stable LSH and RLSH, and Robust-LSH’s running time is far less than RLSH. For KD-tree, its ACC scores are always 1, because it is an exact K-nearest neighbors searching method. However, its running time for these two real data sets is longer than Robust-LSH. Therefore, in terms of ACC and running time, the proposed method Robust-LSH is more effective than other KNNS algorithms.

Table 4 The ACC and running time of algorithms on Patches
Table 5 The ACC and running time of algorithms on MINIST

6 Conclusions and future work

LSH is a popular approximate KNNS method for high-dimensional data. However, the searching strategy of p-stable LSH is sensitive to the parameters in constructing LSH index. RLSH improves the query strategy by randomly selecting hash tables to project the query point, but it has to take much longer time to obtain a desired result. In this paper, we propose a novel strategy to search K-nearest neighbors, called Robust-LSH, which is robust to the parameters of constructing LSH index. It makes full use of points that frequently appear together with the query points to improve the diversity of candidates, so that it can use fewer hash tables to obtain more valuable candidates for searching K-nearest neighbors. The experimental results illustrate the robustness and effectiveness of our proposed algorithm.

The proposed algorithm is robust to the number of hash tables and we can use fewer hash tables to obtain high-quality of candidates for KNNS. However, the number of candidates obtained by Robust-LSH is larger than that of p-stable LSH, thus it takes a little longer time than p-stable LSH.

In the future, we will explore the parallel implementation of the algorithm to further improve the efficiency of Robust-LSH. At the same time, we will also explore its application in clustering analysis.