Learning to hash: forgiving hash functions and applications

Baluja, Shumeet; Covell, Michele

doi:10.1007/s10618-008-0096-z

Learning to hash: forgiving hash functions and applications

Published: 16 May 2008

Volume 17, pages 402–430, (2008)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Learning to hash: forgiving hash functions and applications

Download PDF

Shumeet Baluja¹ &
Michele Covell¹

556 Accesses
38 Citations
3 Altmetric
Explore all metrics

Abstract

The problem of efficiently finding similar items in a large corpus of high-dimensional data points arises in many real-world tasks, such as music, image, and video retrieval. Beyond the scaling difficulties that arise with lookups in large data sets, the complexity in these domains is exacerbated by an imprecise definition of similarity. In this paper, we describe a method to learn a similarity function from only weakly labeled positive examples. Once learned, this similarity function is used as the basis of a hash function to severely constrain the number of points considered for each lookup. Tested on a large real-world audio dataset, only a tiny fraction of the points (~0.27%) are ever considered for each lookup. To increase efficiency, no comparisons in the original high-dimensional space of points are required. The performance far surpasses, in terms of both efficiency and accuracy, a state-of-the-art Locality-Sensitive-Hashing-based (LSH) technique for the same problem and data set.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aucouturier J, Pachet F (2002) Music similarity measures: what’s the use? In: Proceedings of the 3rd international conference on music information retrieval
Baluja S (2007) Automated image orientation detection: a scalable boosting approach. Pattern Anal Appl 10(3): 247–263
Article MathSciNet Google Scholar
Baluja S, Covell M (2006) Content fingerprinting with wavelets. In: Third European conference on visual media production (CVMP), pp 198–207
Baluja S, Covell M (2007) Learning forgiving hash functions: algorithms and large-scale tests. In: International joint conference on artificial intelligence
Bar-Hillel A, Hertz T, Shental N, Weinshall D (2003) Learning distance functions using equivalence relations. In: Proceedings of the twentieth international conference on machine learning
Brieman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
Google Scholar
Burges JC, Platt JC, Jana S (2003) Distortion discriminant analysis for audio fingerprinting. IEEE Trans Speech Audi Processing 11: 165–174
Article Google Scholar
Bylander T, Tate L (2006) Using validation sets to avoid overfitting in AdaBoost. In: Proceedings of the 19th international Florida artificial intelligence research society conference, pp 544–549
Caruana R, Baluja S, Mitchell T (1996) Using the future to “sort out” the present: rankprop and multitask learning. Neural Inf Process Syst 8: 959–965
Google Scholar
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, pp 313–324
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C (2001) Finding interesting associations without support pruning. Knowl Data Eng 13(1): 64–78
Article Google Scholar
Covell M, Baluja S (2007) Known-audio detection using waveprint: spectrogram fingerprinting by wavelet hashing. In: Proceedings of the international conference on acoustics, speech, and signal processing
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on machine learning, pp 148–156
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large databases. Edinburgh, Scotland, UK, pp 518–529
Haitsma J, Kalker T (2002), A highly robust audio fingerprinting system. In: Proceedings of International conference on music information retrieval
Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor. IEEE PAMI, 18
Jacobs C, Finkelstein A, Salesin D (1995) Fast multiresolution image querying. In: Proceedings SIGGRAPH
Ke Y, Hoiem D, Sukthankar R (2005) Computer vision for music identification. In: Proceedings of computer vision and pattern recognition, pp 597–604
Pampalk E (2006) Computational models of music similarity and their application to music information retrieval. Doctoral Thesis, Vienna University of Technology, Austria, March 2006
Shakhnarovich G, Viola P, Darrell T (2003) Fast pose estimation with parameter sensitive hashing. In: Proceedings of the international conference on computer vision
Shazam Entertainment (2005). http://shazamentertainment.com
Tieu K, Viola P (2000) Boosting image retrieval. In: Proceedings of computer vision and pattern recognition
Tsang IW, Cheung P-M, Kwok JT (2005) Kernel relevant component analysis for distance metric learning. In: Proceedings of the 2005 IEEE international joint conference on neural networks, vol 2, pp 954–959
Viola P, Jones MJ (2001) Robust real-time object detection. In: Proceedings of the IEEE workshop on statistical and computational theories of vision
Wu J, Rehg J, Mullin M (2003) Learning a rare event detection cascade by direct feature selection. Adv Neural Inf Process Syst 16
Zhang L, Li M, Zhang H (2002) Boosting image orientation detection with indoor vs. outdoor classification. In: IEEE workshop on applications of computer vision

Download references

Author information

Authors and Affiliations

Google, Inc., 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA
Shumeet Baluja & Michele Covell

Authors

Shumeet Baluja
View author publications
You can also search for this author in PubMed Google Scholar
Michele Covell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shumeet Baluja.

Additional information

Responsible editor: Eamonn Keogh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baluja, S., Covell, M. Learning to hash: forgiving hash functions and applications. Data Min Knowl Disc 17, 402–430 (2008). https://doi.org/10.1007/s10618-008-0096-z

Download citation

Received: 09 January 2008
Accepted: 23 April 2008
Published: 16 May 2008
Issue Date: December 2008
DOI: https://doi.org/10.1007/s10618-008-0096-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning to hash: forgiving hash functions and applications

Abstract

Article PDF

Similar content being viewed by others

Learning Binary Hash Codes for Large-Scale Image Search

Sharing hash codes for multiple purposes

Fast spectral analysis for approximate nearest neighbor search

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning to hash: forgiving hash functions and applications

Abstract

Article PDF

Similar content being viewed by others

Learning Binary Hash Codes for Large-Scale Image Search

Sharing hash codes for multiple purposes

Fast spectral analysis for approximate nearest neighbor search

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation