Abstract
Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer “data loss prevention” (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100th time).
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Spyropoulos, C.D.: An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: SIGIR 2000: Proceedings of the 23rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167. ACM, New York (2000)
Internet Archive. Brown corpus, http://www.archive.org/details/BrownCorpus
Internet Archive. Wayback machine, http://www.archive.org/web/web.php
Attardi, G., Gull, A., Sebastiani, F.: Automatic web page categorization by link and context analysis. In: Proceedings of THAI 1999, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence (1999)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT 1998: Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM, New York (1998)
Borko, H., Bernick, M.: Automatic document classification. J. ACM 10(2), 151–162 (1963)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7(3), 163–178 (1998)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of Artificial Inteligence Research 16, 321–357 (2002)
Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorer Newsletter 6(1) (2004)
Privacy Rights Clearinghouse. Chronology of data breaches: Security breaches 2005–present (August 2010), http://www.privacyrights.org/data-breach
Cohen, W.W.: Learning rules that classify e-mail. In: In Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25. AAAI Press, Menlo Park (1996)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning, 273–297 (1995)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1) (1967)
Dyncorp. Dyncorp website, http://www.dyncorp.com
Wikimedia Foundation. Wikipedia, http://en.wikipedia.org/
Freedman, D.: Statistical Models: Theory and Practice. Cambridge University Press, New York (2005)
Fuhr, N., Knorz, G.E.: Retrieval test evaluation of a rule based automatic indexing (air/phys). In: Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, pp. 391–408. Cambridge University Press, Cambridge (1984)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc, San Francisco (2007)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10. NIPS 1997, pp. 507–513. MIT Press, Cambridge (1997)
Hayes, P.J., Weinstein, S.P.: Construe/tis: A system for content-based indexing of a database of news stories. In: IAAI 1990: Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, pp. 49–64. AAAI Press, Menlo Park (1991)
Hearst, M.: Teaching applied natural language processing: triumphs and tribulations. In: TeachNLP 2005: Proceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, USA (2005)
Hitz, F.: Why Spy?: Espionage in an Age of Uncertainty. Thomas Dunne Books (2008)
Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2008, pp. 179–186. ACM, New York (2008)
Poneman Institute. Fourth annual us cost of data breach study (January 2009), http://www.ponemon.org/local/upload/fckjail/generalcontent/18/file/2008-2009USCostofDataBreachReportFinal.pdf
Japkowicz, N.: Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning 42, 97–122 (2001)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: Learning to Classify Text Using Support Vector Machines – Methods, Theory, and Algorithms. Springer, Kluwer (2002)
Joachims, T.: Making large-scale support vector machine learning practical. MIT Press, Cambridge (1999)
Joachims, T.: Transductive inference for text classification using support vector machines. In: International Conference on Machine Learning (ICML), Bled, Slowenien, pp. 200–209 (1999)
Koller, D., Lerner, U., Angelov, D.: A general algorithm for approximate inference and its application to hybrid bayes nets. In: Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (1999)
Lewis, D.: Reuters 21578, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: IJCAI 2003: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 587–592. Morgan Kaufmann Publishers Inc, San Francisco (2003)
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Maron, M.E.: Automatic indexing: An experimental inquiry. J. ACM 8(3), 404–417 (1961)
McAfee. Data loss prevention, http://www.mcafee.com/us/enterprise/products/data_loss_prevention/
Transcendental Meditation. Transcendental meditation websites, http://www.alltm.org , http://www.alltm.org
Minier, Z., Bodo, Z., Csato, L.: Wikipedia-based kernels for text categorization. In: Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 157–164. IEEE Computer Society Press, Los Alamitos (2007)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning. ICML 1999, pp. 258–267. Morgan Kaufmann, San Francisco (1999)
Ng, K.: A comparative study of the practical characteristics of neural network and conventional pattern classifiers. Technical report (1990)
Church of Jesus Christ of Latter Day Saints. Church of jesus christ of latter day saints website, http://lds.org
Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the 17th International Conference on World Wide Web (WWW 2008), pp. 91–100. ACM, New York (2008)
Proofpoint. Outbound email security and data loss prevention, http://www.proofpoint.com/id/outbound/index.php
proofpoint. Unified email security, email archiving, data loss prevention and encryption, http://www.proofpoint.com/products/
RSA Data Loss Prevention, http://www.rsa.com/node.aspx?id=1130
Schapire, R.E.: Theoretical Views of Boosting and Applications. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 13–25. Springer, Heidelberg (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 2, pp. 1419–1424. AAAI Press, Menlo Park (2006)
Symantec. Data Loss Prevention Products & Services, http://www.symantec.com/business/theme.jsp?themeid=vontu
Thet, T.T., Na, J.-C., Khoo, C.S.G.: Filtering product reviews from web search results. In: DocEng 2007: Proceedings of the 2007 ACM symposium on Document engineering, pp. 196–198. ACM, New York (2007)
Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: CIKM 2001: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 105–113. ACM, New York (2001)
Trend Micro. Trend Micro Data Loss Prevention, http://us.trendmicro.com/us/products/enterprise/data-loss-prevention/
Trustwave. Global security report 2010 (February 2010), https://www.trustwave.com/whitePapers.php
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hart, M., Manadhata, P., Johnson, R. (2011). Text Classification for Data Loss Prevention. In: Fischer-Hübner, S., Hopper, N. (eds) Privacy Enhancing Technologies. PETS 2011. Lecture Notes in Computer Science, vol 6794. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22263-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-22263-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22262-7
Online ISBN: 978-3-642-22263-4
eBook Packages: Computer ScienceComputer Science (R0)