Abstract
The problem of efficiently finding similar items in a large corpus of high-dimensional data points arises in many real-world tasks, such as music, image, and video retrieval. Beyond the scaling difficulties that arise with lookups in large data sets, the complexity in these domains is exacerbated by an imprecise definition of similarity. In this paper, we describe a method to learn a similarity function from only weakly labeled positive examples. Once learned, this similarity function is used as the basis of a hash function to severely constrain the number of points considered for each lookup. Tested on a large real-world audio dataset, only a tiny fraction of the points (~0.27%) are ever considered for each lookup. To increase efficiency, no comparisons in the original high-dimensional space of points are required. The performance far surpasses, in terms of both efficiency and accuracy, a state-of-the-art Locality-Sensitive-Hashing-based (LSH) technique for the same problem and data set.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aucouturier J, Pachet F (2002) Music similarity measures: what’s the use? In: Proceedings of the 3rd international conference on music information retrieval
Baluja S (2007) Automated image orientation detection: a scalable boosting approach. Pattern Anal Appl 10(3): 247–263
Baluja S, Covell M (2006) Content fingerprinting with wavelets. In: Third European conference on visual media production (CVMP), pp 198–207
Baluja S, Covell M (2007) Learning forgiving hash functions: algorithms and large-scale tests. In: International joint conference on artificial intelligence
Bar-Hillel A, Hertz T, Shental N, Weinshall D (2003) Learning distance functions using equivalence relations. In: Proceedings of the twentieth international conference on machine learning
Brieman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
Burges JC, Platt JC, Jana S (2003) Distortion discriminant analysis for audio fingerprinting. IEEE Trans Speech Audi Processing 11: 165–174
Bylander T, Tate L (2006) Using validation sets to avoid overfitting in AdaBoost. In: Proceedings of the 19th international Florida artificial intelligence research society conference, pp 544–549
Caruana R, Baluja S, Mitchell T (1996) Using the future to “sort out” the present: rankprop and multitask learning. Neural Inf Process Syst 8: 959–965
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, pp 313–324
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C (2001) Finding interesting associations without support pruning. Knowl Data Eng 13(1): 64–78
Covell M, Baluja S (2007) Known-audio detection using waveprint: spectrogram fingerprinting by wavelet hashing. In: Proceedings of the international conference on acoustics, speech, and signal processing
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on machine learning, pp 148–156
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large databases. Edinburgh, Scotland, UK, pp 518–529
Haitsma J, Kalker T (2002), A highly robust audio fingerprinting system. In: Proceedings of International conference on music information retrieval
Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor. IEEE PAMI, 18
Jacobs C, Finkelstein A, Salesin D (1995) Fast multiresolution image querying. In: Proceedings SIGGRAPH
Ke Y, Hoiem D, Sukthankar R (2005) Computer vision for music identification. In: Proceedings of computer vision and pattern recognition, pp 597–604
Pampalk E (2006) Computational models of music similarity and their application to music information retrieval. Doctoral Thesis, Vienna University of Technology, Austria, March 2006
Shakhnarovich G, Viola P, Darrell T (2003) Fast pose estimation with parameter sensitive hashing. In: Proceedings of the international conference on computer vision
Shazam Entertainment (2005). http://shazamentertainment.com
Tieu K, Viola P (2000) Boosting image retrieval. In: Proceedings of computer vision and pattern recognition
Tsang IW, Cheung P-M, Kwok JT (2005) Kernel relevant component analysis for distance metric learning. In: Proceedings of the 2005 IEEE international joint conference on neural networks, vol 2, pp 954–959
Viola P, Jones MJ (2001) Robust real-time object detection. In: Proceedings of the IEEE workshop on statistical and computational theories of vision
Wu J, Rehg J, Mullin M (2003) Learning a rare event detection cascade by direct feature selection. Adv Neural Inf Process Syst 16
Zhang L, Li M, Zhang H (2002) Boosting image orientation detection with indoor vs. outdoor classification. In: IEEE workshop on applications of computer vision
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Eamonn Keogh.
Rights and permissions
About this article
Cite this article
Baluja, S., Covell, M. Learning to hash: forgiving hash functions and applications. Data Min Knowl Disc 17, 402–430 (2008). https://doi.org/10.1007/s10618-008-0096-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0096-z