Abstract
This survey paper considers index structures for fast similarity search for objects represented by real-valued vectors. Index structures based on locality-sensitive hashing and their modifications are discussed. The ideas of specific algorithms, including the recently proposed ones, are stated. Their interrelations and some theoretical aspects are discussed.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
D. A. Rachkovskij, “Real-valued embeddings and sketches for fast distance and similarity estimation,” Cybernetics and Systems Analysis, Vol. 52, No. 6, 967–988 (2016).
D. A. Rachkovskij, “Binary vectors for fast distance and similarity estimation,” Cybernetics and Systems Analysis, Vol. 53, No. 1, 138–156 (2017).
D. A. Rachkovskij, “Distance-based index structures for fast similarity search,” Cybernetics and Systems Analysis, Vol. 53, No. 4, 636–658 (2017).
D. A. Rachkovskij, “Index structures for fast similarity search for binary vectors,” Cybernetics and Systems Analysis, Vol. 53, No. 5, 799–820 (2017).
C.Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, New York (2008).
R. Datta, D. Joshi, J. Li, and J. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Computing Surveys, Vol. 40, No. 2, 1–60 (2008).
Ì. Ì. Fouad, “Content-based search for image retrieval,” Int. J. Image, Graphics and Signal Processing, Vol. 5, No. 11, 46–52 (2013).
F. A. Khalifa, N. A. Semary, H. M. El-Sayed, and M. M. Hadhoud, “Local detectors and descriptors for object class recognition,” Int. J. of Intelligent Systems and Applications, Vol. 7, No. 10, 12–18 (2015).
A. Ziomek and M. Oszust, “Evaluation of interest point detectors in presence of noise,” Int. J. Intelligent Systems and Applications, Vol. 8, No. 3, 26–33 (2016).
S. Fortune, “Voronoi diagrams and Delaunay triangulations,” in: Handbook of Discrete and Computational Geometry, Chap. 27, 3rd Edition, CRC Press, Boca Raton, USA (2017), pp. 705–721.
S. Meiser, “Point location in arrangements of hyperplanes,” Inform. and Comput., Vol. 106, No. 2, 286–303 (1993).
A. Andoni and P. Indyk, “Nearest neighbors in high-dimensional spaces,” in: Handbook of Discrete and Computational Geometry, Chap. 43, 3rd Edition, CRC Press, Boca Raton, USA (2017), pp. 1133–1153.
R. Weber, H. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in: Proc. VLDB’98 (1998), pp. 194–205.
S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” Journal of the ACM, Vol. 45, No. 6, 891–923 (1998).
A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Communications of the ACM, Vol. 51. No. 1, 117–122 (2008).
S. Har-Peled, P. Indyk, and R. Motwani, “Approximate nearest neighbor: Towards removing the curse of dimensionality,” Theory Comput., Vol. 8, 321–350 (2012).
D. M. W. Powers, “Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation,” J. of Machine Learning Tech., Vol. 2, No. 1, 37–63 (2011).
R. Das, S. Thepade, and S. Ghosh, “Content based image recognition by information fusion with multiview features. I,” J. Information Technology and Computer Science, Vol. 7, No. 10, 61–73 (2015).
S. Ramaswamy and K. Rose, “Adaptive cluster distance bounding for high-dimensional indexing,” IEEE Trans. on KDE, Vol. 23, No. 6, 815–830 (2011).
M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE TPAMI, Vol. 36, No. 11, 2227–2240 (2014).
A. Shrivastava and P. Li, “Asymmetric minwise hashing for indexing binary inner products and set containment,” in: Proc. WWW’15 (2015), pp. 981–991.
M. Charikar, “Similarity estimation techniques from rounding algorithms,” in: Proc. STOC’02 (2002), pp. 380–388.
M. Aumuller, T. Christiani, R. Pagh, and F. Silvestr, Distance Sensitive Hashing. arXiv:1703.07867. 22 Mar 2017.
A. Andoni, M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing using stable distributions,” in: Nearest Neighbor Methods for Learning and Vision: Theory and Practice, MIT Press, Cambridge (2006), pp. 61–72.
N. Pham, “Hybrid LSH: Faster near neighbors reporting in high-dimensional space,” in: Proc. EDBT’17 (2017), pp. 454–457.
J. Wang, H. T. Shen, J. Song, and J. Ji, Hashing for Similarity Search: A Survey. arXiv:1408.2927. 13 Aug 2014.
J. Tang and Y. Tian, “A systematic review on minwise hashing algorithms,” Annals of Data Science, Vol. 3, No. 4, 445–468 (2016).
B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing,” IEEE Trans. PAMI, Vol. 34, No. 6, 1092–1104 (2012).
Y. Mu and S. Yan, “Non-metric locality sensitive hashing,” in: Proc. AAAI’10 (2010), pp. 539–544.
A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” in: Proc. FOCS’06 (2006), pp. 459-468.
H. Jegou, L. Amsaleg, C. Schmid, and P. Gros, “Query-adaptive locality sensitive hashing,” in: Proc. ICASSP’08 (2008), pp. 825–828.
F. Chierichetti and R. Kumar, “Lsh-preserving functions and their applications,” J. ACM, Vol. 62, No. 5, 33:1–33:25 (2015).
F. Chierichetti, R. Kumar, A. Panconesi, and E. Terolli, “The distortion of locality sensitive hashing,” in: Proc. ITCS’17 (2017), p. 23.
A. Sokolov, “Investigation of accelerated search for close text sequences with the help of vector representations,” Cybernetics and Systems Analysis, Vol. 44, No. 4, 493–506 (2008).
A. Andoni, R. Krauthgamer, and I. P. Razenshteyn, “Sketching and embedding are equivalent for norms,” in: Proc. STOC’15 (2015), pp. 479–488.
A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal LSH for angular distance,” in: Proc. NIPS’15 (2015), pp. 1225-1233.
K. Terasawa and Y. Tanaka, “Spherical lsh for approximate nearest neighbor search on unit hypersphere,” in: Proc. WADS’07 (2007), pp. 27–38.
K. Eshghi and S. Rajaram, “Locality sensitive hash functions based on concomitant rank order statistics,” in: Proc. KDD’08 (2008), pp. 221–229.
A. Andoni and I. Razenshteyn, “Optimal data-dependent hashing for approximate near neighbors,” in: Proc. STOC’15 (2015), pp. 793–801.
T. Laarhoven, “Hypercube LSH for approximate near neighbors,” in: Proc. MFCS’17 (2017).
C. Kennedy and R. Ward, “Fast cross-polytope locality-sensitive hashing,” in: Proc. ITCS’17 (2017).
R. Panigrahy, “Entropy based nearest neighbor search in high dimensions,” in: Proc. SODA’06 (2006), pp. 1186–1195.
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe lsh: efficient indexing for high-dimensional similarity search,” in: Proc. VLDB’07 (2007), pp. 950–961.
A. Joly and O. Buisson, “A posteriori multi-probe locality sensitive hashing,” in: Proc. MM’08 (2008), pp. 209–218.
M. Slaney, Y. Lifshits, and J. He, “Optimal parameters for locality-sensitive hashing,” Proc. IEEE, Vol. 100, No. 9, 2604–2623 (2012).
M. Kapralov, “Smooth tradeoffs between insert and query complexity in nearest neighbor search,” in: Proc. PODS’15 (2015), pp. 329–342.
T. D. Ahle, M. Aumuller, and R. Pagh, “Parameter-free locality sensitive hashing for spherical range reporting,” in: Proc. SODA’17 (2017), pp. 239–256.
A. Pacuk, P. Sankowski, K. Wegrzycki, and P. Wygocki, “Locality-sensitive hashing without false negatives for lp,” in: Proc. COCOON’16 (2016), pp. 105–118.
P. Wygocki, On Fast Bounded Locality Sensitive Hashing. arXiv:1704.05902. 19 Apr 2017.
W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li, “Modeling lsh for performance tuning,” in: Proc. CIKM’08 (2008), pp. 669–678.
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier, “Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm,” in: Proc. AofA’07 (2007), pp. 127–146.
A. Chakrabarti, V. Satuluri, A. Srivathsan, and S. Parthasarathy, “A Bayesian perspective on locality sensitive hashing with extensions for kernel methods,” ACM TKDD, Vol. 10, No. 2, 19:1–19:32 (2015).
M. Bawa, T. Condie, and P. Ganesan, “Lsh forest: Self-tuning indexes for similarity search,” in: Proc. WWW’05 (2005), pp. 651–660.
A. Andoni, I. Razenshteyn, and N. Shekel Nosatzki, “Lsh forest: Practical algorithms made theoretical,” in: Proc. SODA’17 (2017), pp. 67–78.
Y. Tao, K. Yi, C. Sheng, and P. Kalnis, “Efficient and accurate nearest neighbor and closest pair search in high dimensional space,” ACM TODS, Vol. 35, No. 3, 20:1–20:46 (2010).
J. K. Lawder and P. J. H. King, “Querying multi-dimensional data indexed using the Hilbert space filling curve,” ACM SIGMOD Record, Vol. 30, No. 1, 19–24 (2001).
D. Comer, “The ubiquitous B-tree,” ACM Comput. Surv., Vol. 11, 121–138 (1979).
Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen, “Sk-lsh: An efficient index structure for approximate nearest neighbor search,” in: Proc. VLDB Endowment, Vol. 7, No. 9, 745-756 (2014).
J. Chen, C. He, G. Hu, and J. Shao, “SELSH: A hashing scheme for approximate similarity search with early stop condition,” in: Proc. MMM’16, Vol. 2 (2016), pp. 104–115.
F. Hao, J. Daugman, and P. Zielinski, “A fast search algorithm for a large fuzzy database,” IEEE Trans. Information Forensics and Security, Vol. 3, No. 2, 203–212 (2008).
K. Ling and G. Wu, “Frequency based locality sensitive hashing,” in: Proc. ICMT’11 (2011), pp. 4929–4932.
J. Gan, J. Feng, Q. Fang, and W. Ng, “Locality-sensitive hashing scheme based on dynamic collision counting,” in: Proc. SIGMOD’12 (2012), pp. 541–552.
Y. Zheng, Q. Guo, A. K. H. Tung, and S. Wu, “LazyLSH: Approximate nearest neighbor search for multiple distance functions with a single index,” in Proc. SIGMOD’16 (2016), pp. 2023–2037.
Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng, “Query-aware locality-sensitive hashing for approximate nearest neighbor search,” Proc. VLDB Endowment, Vol 9, No. 1, 1–12 (2015).
X. Zhang, M. Wang, and J. Cui, “Efficient indexing of binary LSH for high dimensional nearest neighbor,” Neurocomputing, Vol. 213, 24–33 (2016).
M. Norouzi, A. Punjani, and D. J. Fleet, “Fast exact search in Hamming space with multi-index hashing,” IEEE Trans. PAMI, Vol. 36, No. 6, 1107–1119 (2014).
J. Gao, H. V. Jagadish, B. C. Ooi, and S. Wang, “Selective hashing: Closing the gap between radius search and k-NN search,” in: Proc. SIGKDD’15 (2015), pp. 349–358.
A. Andoni, T. Laarhoven, I. Razenshteyn, and E. Waingarten, “Optimal hashing-based time-space trade-offs for approximate near neighbors” in: Proc. SODA’17 (2017), pp. 47–66.
A. Becker, L. Ducas, N. Gama, and T. Laarhoven, “New directions in nearest neighbor searching with applications to lattice sieving,” in: Proc. SODA’16 (2016), pp. 10–24.
T. Christiani, “A framework for similarity search with space-time tradeoffs using locality-sensitive filtering,” in: Proc. SODA’17 (2017), pp. 31–46.
D. A. Rachkovskij, I. S. Misuno, and S. V. Slipchenko, “Randomized projective methods for construction of binary sparse vector representations,” Cybernetics and Systems Analysis, Vol. 48, No. 1, 146–156 (2012).
D. A. Rachkovskij, “Formation of similarity-reflecting binary vectors with random binary projections,” Cybernetics and Systems Analysis, Vol. 51, No. 2, 313–323 (2015).
R. Donaldson, A. Gupta, Y. Plan, and T. Reimer, Random Mappings Designed for Commercial Search Engines. arXiv:1507.05929. 21 Jul 2015.
S. Ferdowsi, S. Voloshynovskiy, D. Kostadinov, and T. Holotyak, “Fast content identification in highdimensional feature spaces using sparse ternary codes,” in: Proc. WIFS’16 (2016), pp. 1–6.
G. Valiant, “Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem,” J. ACM, Vol. 62, No. 2, 13:1–13:45 (2015).
H. L. Nguyen, Algorithms for High Dimensional Data, PhD Thesis, Princeton University (2014). URL: http://arks.princeton.edu/ark:/88435/dsp01b8515q61f.
A. Rahimi and B. Recht, “Random features for large-scale kernel machine,” in: Proc. NIPS’07 (2007), pp. 1177–1184.
R. O’Donnell, Y. Wu, and Y. Zhou, “Optimal lower bounds for locality sensitive hashing (except when q is tiny),” ACM TOCS, Vol. 6, No. 1, 5.1–5.13 (2014).
J. Wang, W. Liu, S. Kumar, and S.-F. Chang, “Learning to hash for indexing big data: A survey,” Proc. IEEE, Vol. 104, No. 1, 34–57 (2016).
J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A Survey on Learning to Hash,” IEEE Trans. PAMI. doi:https://doi.org/10.1109/TPAMI.2017.2699960.
L. Gao, J. Song, X. Liu, J. Shao, J. Liu, and J. Shao, “Learning in high-dimensional multimedia data: The state of the art,” Multimedia Systems, Vol. 23, No. 3, 303–313 (2017).
W. Mou and L. Wang, “A refined analysis of lsh for well-dispersed data points,” in: Proc. ANALCO’17 (2017), pp. 174–182.
A. Andoni, P. Indyk, H. L. Nguyen, and I. Razenshteyn, “Beyond locality-sensitive hashing,” in: Proc. SODA’14 (2014), pp. 1018–1028.
A. Andoni and I. Razenshteyn, “Tight lower bounds for data-dependent locality-sensitive hashing,” in: Proc. SoCG’16 (2016), pp. 9:1–9:11.
V. I. Gritsenko, D. A. Rachkovskij, A. A. Frolov, R. Gayler, D. Kleyko, and E. Osipov, “Neural distributed autoassociative memories: A survey,” Cybernetics and Computer Engineering, No. 2 (188), 5–35 (2017).
Y. Wang, A. Shrivastava, and J. Ryu, FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search. arXiv:1709.01190. 4 Sep 2017.
A. Shrivastava, “Optimal densification for fast and accurate minwise hashing,” in: Proc. ICML’17 (2017), pp. 3154–3163.
Author information
Authors and Affiliations
Corresponding author
Additional information
Translated from Kibernetika i Sistemnyi Analiz, No. 1, January–February, 2018, pp. 168–183.
Rights and permissions
About this article
Cite this article
Rachkovskij, D.A. Index Structures for Fast Similarity Search for Real-Valued Vectors. I. Cybern Syst Anal 54, 152–164 (2018). https://doi.org/10.1007/s10559-018-0016-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10559-018-0016-1