Abstract
We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the classification of documents in other domains.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aas, K., & Eikvil, L. (1999). Text categorisation: A survey. Technical report, Norwegian Computing Center. Available online: citeseer.ist.psu.edu/aas99text.html.
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An evaluation of naive bayesian anti-spam filtering. In V. M. G. Potamias and M. van Someren (Eds.), Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), pp. 9–17, Barcelona, Spain.
apache.org. (2005). The apache spamassassin project. Webpage (last accessed November 3, 2005): http://spamassassin.apache.org/index.html.
Bejerano, G., & Yona, G. (2001). Variations on probabilistic suffix trees: Statistical modeling and prediction of protein families. Bioinformatics, 17(1), 23–43.
de Freitas, S., & Levene, M. (2004). Spam on the internet: Is it here to stay or can it be eradicated? JISC Technology and Standards Watch Reports (04–01).
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley-Interscience.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Available online: citeseer.ist.psu.edu/fawcett04roc.html.
Flach, P., & Lachiche, N. (2004). Naive bayes classification of structured data. Machine Learning, 57(3), 233–269.
Giegerich, R., & Kurtz, S. (1997). From ukkonen to mccreight and weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3), 331–353.
Graham-Cummings, J. (2004). Spammers compendium. Webpage (last accessed October 20, 2004): http://www.jgc.org/tsc/.
Gusfield, D. (1997). Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge Unversity Press.
Kurtz, S. (1999). Reducing the space requirement of suffix trees. Software Practice and Experience, 29(13), 1149–1171.
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nédellec and C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning. (pp. 4–15). Springer Verlag, Heidelberg.
Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson (Eds.), Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval. Zürich, CH, (pp. 298–306). New York, US: ACM Press.
Li, Y. H., & Jain, A. K. (1998). Classification of text documents. Comput. J., 41(8), 537–546.
Lloyd, A. (2000). Suffix trees. Webpage (last accessed October 20, 2004): http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/.
Lu, B., & Chen, T. (2003). A suffix tree approach to the interpretation of tandem mass spectra: Applications to peptides of non-specific digestion and post-translational modifications. Bioinformatics, 1990(02), 113ii–121.
Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. MIT Press.
McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification.
Meyer, T., & Whateley, B. (2004). SpamBayes: Effective open-source, Bayesian based, email classification system. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). Mountain View, CA. Available online: http://www.ceas.cc/papers-2004/136.pdf.
Michelakis, E., Androutsopoulos, I., Paliouras, G., Sakkis, G., & Stamatopoulos, P. (2004). Filtron: A learning-based anti-spam filter. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). Mountain View, CA. Available online: http://www.ceas.cc/papers-2004/142.pdf.
Porter, M. F. (1997). An algorithm for suffix stripping. In Readings in information retrieval. (pp. 313–316), Morgan Kaufmann Publishers Inc.
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: experiments in automated document processing. (pp. 313–323). Prentice-Hall, Inc.
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for text categorization: papers from the 1998 workshop. Madison, Wisconsin, AAAI Technical Report WS-98–05. Available online: http://citeseer.ist.psu.edu/sahami98bayesian.html.
Schneider, K.-M. (2003). A comparison of event models for naive bayes anti-spam e-mail filtering. In Proc. 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003). (pp. 307–314). Budapest, Hungary.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Surkov, D. (2004). Inductive confidence machine for pattern recognition: Is it the next step towards AI?. Ph.D. thesis, Royal Holloway University of London.
Ukkonen, E. (1992). Constructing suffix-trees on-line in linear time. Algorithms, Software, Architecture: Information Processing, 1(92), 484–492.
unspam.com. (2004). Spam numbers and statistics. Webpage (last accessed October 20, 2004): http://www.unspam.com/fight-spam/information/spamstats.html.
Vapnik, V. N. (1999). The nature of statistical learning theory (information science and statistics). Springer.
Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. J. (2005). Text mining: Predictive methods for analyzing unstructured information. Springer.
Wittel, G. L., & Wu, S. F. (2004). On attacking statistical spam filters. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). Mountain View, CA. Available online: http://www.ceas.cc/papers-2004/170.pdf.
Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4), 243–269.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Tom Fawcett
Rights and permissions
About this article
Cite this article
Pampapathi, R., Mirkin, B. & Levene, M. A suffix tree approach to anti-spam email filtering. Mach Learn 65, 309–338 (2006). https://doi.org/10.1007/s10994-006-9505-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-9505-y