Abstract
This paper introduces “class-aware similarity hashes” or “classprints,” which are an outgrowth of recent work on similarity hashing. The approach builds on the notion of context-based hashing to create a framework for identifying data types based on content and for building characteristic similarity hashes for individual data items that can be used for correlation. The principal benefits are that data classification can be fully automated and that a priori knowledge of the underlying data is not necessary beyond the availability of a suitable training set.
Chapter PDF
Similar content being viewed by others
References
S. Brin, J. Davis and H. Garcia-Molina, Copy detection mechanisms for digital documents, Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 398-409, 1995.
B. Bloom, Space/time tradeoffs in hash coding with allowable errors, Communications of the ACM, vol. 13(7), pp. 422-426, 1970.
A. Broder, S. Glassman, M. Manasse and G. Zweig, Syntactic clustering of the web, Proceedings of the Sixth International World Wide Web Conference, pp. 391-404, 1997.
A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Mathematics, vol. 1(4), pp. 485-509, 2005.
J. Kornblum, Identifying almost identical files using context triggered piecewise hashing, Proceedings of the Sixth Digital Forensic Research Workshop, 2006.
National Institute of Standards and Technology, National Software Reference Library, Gaithersburg, Maryland (www.nsrl.nist.gov).
D. Patterson, Latency lags bandwidth, Communications of the ACM, vol. 47(10), pp. 71-75, 2004.
M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts, 1981.
V. Roussev, Y. Chen, T. Bourg and G. Richard III, md5bloom: Forensic file system hashing revisited, Proceedings of the Sixth Digital Forensic Research Workshop, 2006.
V. Roussev, G. Richard III and L. Marziale, Multi-resolution similarity hashing, Proceedings of the Seventh Digital Forensic Research Workshop, 2007.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 IFIP International Federation for Information Processing
About this paper
Cite this paper
Roussev, V., Richard, G., Marziale, L. (2008). Class-Aware Similarity Hashing for Data Classification. In: Ray, I., Shenoi, S. (eds) Advances in Digital Forensics IV. DigitalForensics 2008. IFIP — The International Federation for Information Processing, vol 285. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84927-0_9
Download citation
DOI: https://doi.org/10.1007/978-0-387-84927-0_9
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84926-3
Online ISBN: 978-0-387-84927-0
eBook Packages: Computer ScienceComputer Science (R0)