A suffix tree approach to anti-spam email filtering

Pampapathi, Rajesh; Mirkin, Boris; Levene, Mark

doi:10.1007/s10994-006-9505-y

A suffix tree approach to anti-spam email filtering

Published: 27 July 2006

Volume 65, pages 309–338, (2006)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

A suffix tree approach to anti-spam email filtering

Download PDF

Rajesh Pampapathi¹,
Boris Mirkin¹ &
Mark Levene¹

1022 Accesses
26 Citations
3 Altmetric
Explore all metrics

Abstract

We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the classification of documents in other domains.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aas, K., & Eikvil, L. (1999). Text categorisation: A survey. Technical report, Norwegian Computing Center. Available online: citeseer.ist.psu.edu/aas99text.html.
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. (2000). An evaluation of naive bayesian anti-spam filtering. In V. M. G. Potamias and M. van Someren (Eds.), Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), pp. 9–17, Barcelona, Spain.
apache.org. (2005). The apache spamassassin project. Webpage (last accessed November 3, 2005): http://spamassassin.apache.org/index.html.
Bejerano, G., & Yona, G. (2001). Variations on probabilistic suffix trees: Statistical modeling and prediction of protein families. Bioinformatics, 17(1), 23–43.
Article Google Scholar
de Freitas, S., & Levene, M. (2004). Spam on the internet: Is it here to stay or can it be eradicated? JISC Technology and Standards Watch Reports (04–01).
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley-Interscience.
MATH Google Scholar
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Available online: citeseer.ist.psu.edu/fawcett04roc.html.
Flach, P., & Lachiche, N. (2004). Naive bayes classification of structured data. Machine Learning, 57(3), 233–269.
Article MATH Google Scholar
Giegerich, R., & Kurtz, S. (1997). From ukkonen to mccreight and weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3), 331–353.
Article MATH MathSciNet Google Scholar
Graham-Cummings, J. (2004). Spammers compendium. Webpage (last accessed October 20, 2004): http://www.jgc.org/tsc/.
Gusfield, D. (1997). Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge Unversity Press.
Kurtz, S. (1999). Reducing the space requirement of suffix trees. Software Practice and Experience, 29(13), 1149–1171.
Article Google Scholar
Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nédellec and C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning. (pp. 4–15). Springer Verlag, Heidelberg.
Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classifiers. In H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson (Eds.), Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval. Zürich, CH, (pp. 298–306). New York, US: ACM Press.
Li, Y. H., & Jain, A. K. (1998). Classification of text documents. Comput. J., 41(8), 537–546.
Article MATH Google Scholar
Lloyd, A. (2000). Suffix trees. Webpage (last accessed October 20, 2004): http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/.
Lu, B., & Chen, T. (2003). A suffix tree approach to the interpretation of tandem mass spectra: Applications to peptides of non-specific digestion and post-translational modifications. Bioinformatics, 1990(02), 113ii–121.
Article Google Scholar
Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. MIT Press.
McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification.
Meyer, T., & Whateley, B. (2004). SpamBayes: Effective open-source, Bayesian based, email classification system. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). Mountain View, CA. Available online: http://www.ceas.cc/papers-2004/136.pdf.
Michelakis, E., Androutsopoulos, I., Paliouras, G., Sakkis, G., & Stamatopoulos, P. (2004). Filtron: A learning-based anti-spam filter. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). Mountain View, CA. Available online: http://www.ceas.cc/papers-2004/142.pdf.
Porter, M. F. (1997). An algorithm for suffix stripping. In Readings in information retrieval. (pp. 313–316), Morgan Kaufmann Publishers Inc.
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: experiments in automated document processing. (pp. 313–323). Prentice-Hall, Inc.
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for text categorization: papers from the 1998 workshop. Madison, Wisconsin, AAAI Technical Report WS-98–05. Available online: http://citeseer.ist.psu.edu/sahami98bayesian.html.
Schneider, K.-M. (2003). A comparison of event models for naive bayes anti-spam e-mail filtering. In Proc. 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003). (pp. 307–314). Budapest, Hungary.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Article Google Scholar
Surkov, D. (2004). Inductive confidence machine for pattern recognition: Is it the next step towards AI?. Ph.D. thesis, Royal Holloway University of London.
Ukkonen, E. (1992). Constructing suffix-trees on-line in linear time. Algorithms, Software, Architecture: Information Processing, 1(92), 484–492.
unspam.com. (2004). Spam numbers and statistics. Webpage (last accessed October 20, 2004): http://www.unspam.com/fight-spam/information/spamstats.html.
Vapnik, V. N. (1999). The nature of statistical learning theory (information science and statistics). Springer.
Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. J. (2005). Text mining: Predictive methods for analyzing unstructured information. Springer.
Wittel, G. L., & Wu, S. F. (2004). On attacking statistical spam filters. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). Mountain View, CA. Available online: http://www.ceas.cc/papers-2004/170.pdf.
Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4), 243–269.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Systems, Birkbeck College, University of London, London
Rajesh Pampapathi, Boris Mirkin & Mark Levene

Authors

Rajesh Pampapathi
View author publications
You can also search for this author in PubMed Google Scholar
Boris Mirkin
View author publications
You can also search for this author in PubMed Google Scholar
Mark Levene
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajesh Pampapathi.

Additional information

Editor: Tom Fawcett

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pampapathi, R., Mirkin, B. & Levene, M. A suffix tree approach to anti-spam email filtering. Mach Learn 65, 309–338 (2006). https://doi.org/10.1007/s10994-006-9505-y

Download citation

Received: 03 March 2005
Revised: 01 June 2006
Accepted: 12 June 2006
Published: 27 July 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s10994-006-9505-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A suffix tree approach to anti-spam email filtering

Abstract

Article PDF

Similar content being viewed by others

Spam Mail Filtering Method Based on Suffix Tree

E-Mail Spam Filtering: A Review of Techniques and Trends

Supervised classification of spam emails with natural language stylometry

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A suffix tree approach to anti-spam email filtering

Abstract

Article PDF

Similar content being viewed by others

Spam Mail Filtering Method Based on Suffix Tree

E-Mail Spam Filtering: A Review of Techniques and Trends

Supervised classification of spam emails with natural language stylometry

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation