Keywords

1 Introduction

E-mail or electronic-mail is a fast, effective, and inexpensive method of exchanging messages over the Internet. Whether it is a personal message from a family member, a company-wide message from the boss, researchers across continents sharing recent findings, or astronauts staying in touch with their family (via e-mail uplinks or IP phones), e-mail is a preferred means for communication. Used worldwide by 2.3 billion users, at the time of writing the article, e-mail usage is projected to increase up to 4.3 billion accounts by 2016 [1]. But the increasing dependence on e-mail has induced the emergence of many problems caused by ‘illegitimate’ e-mails, i.e., spam. According to the Text Retrieval Conference (TREC) the term ‘spam’ is—any unsolicited e-mail that is sent indiscriminately [2]. Spam e-mails are unsolicited, un-ratified, and usually mass mailed. Spam being a carrier of malware causes the proliferation of unsolicited advertisements, fraud schemes, phishing messages, explicit content, promotions of cause, etc. On an organizational front, spam effects include: (i) annoyance to individual users, (ii) less reliable e-mails, (iii) loss of work productivity, (iv) misuse of network bandwidth, (v) wastage of file server storage space and computational power, (vi) spread of viruses, worms, and Trojan horses, and (vii) financial losses through phishing, denial of service (DoS), directory harvesting attacks, etc.

Figure 1 depicts the e-mail architecture and how e-mail works. Spam is a broad concept that is still not completely understood. In general, spam has many forms—chat rooms are subject to chat spam, blogs are subject to blog spam (splogs), search engines are often misled by web spam (search engine spamming or spamdexing), while social systems are plagued by social spam. This paper focuses on ‘e-mail spam’ and its variants, and not ‘spam’ in general. Prior attempts to review e-mail spam filtering using machine learning have been made, the most notable ones being [2,3,4,5,6,7]; most recent empirical studies being [8,9,10]. We extend earlier surveys by taking an updated set of works into account. We present a content analysis of the major spam-filtering surveys over the period (2004–2015). Significant amounts of historical and recent literature, including gray literature were studied to report recent advances and findings. We believe our survey is of complementary nature and provides an inclusive review of the state-of-the-art methods in content-based e-mail spam filtering. Our work addresses the following:

Fig. 1
figure 1

The e-mail architecture

  • First, we perform an exploration of the major spam characteristics and discuss feature engineering for spam e-mails.

  • Second, we present a qualitative summary of major surveys on spam e-mails over the period (2004–2015) and taxonomy of content-based approaches to e-mail spam filtering.

  • Third, the article reports on evaluation measures, bench marks and new findings and suggest lines of future investigations for emerging spam types.

2 Feature Engineering

Feature selection is a key issue and has become the subject of much research. It has mainly three objectives: (i) enhancing the classifier’s predictive accuracy, (ii) building effective and economical classifiers, and (iii) obtaining a better understanding of the elementary process involved in generation of data. Dimensionality reduction and feature subset selection are two preferred techniques for lowering the feature set dimension. While feature subset selection involves the extraction of a subset of the original attributes, dimensionality reduction involves linear combinations of the original feature set.

Table 1 presents a summary of feature extraction and selection in popular literature. This review article also examined a number of major earlier surveys on spam filtering over the period (2004–2015). A summary of popular machine learning-based techniques categorizing them according to perspective (Algorithm, Architecture, Methods, and Trends) is presented in Table 2.

Table 1 A summary of feature extraction and feature selection techniques in popular literature
Table 2 A summary of popular machine learning-based spam filtering attempts by authors according to perspective with their strengths and limitations

Articles classified under ‘Algorithm’ reflect research that focused on classification algorithms and their implementations and evaluations. Articles classified under ‘Architecture’ concentrated on development of spam filtering infrastructures. Articles classified under ‘Methods’ refers to study of the existing filtering methods while ‘Trends’ speaks of discourses concentrating on emerging methods and the adaptation of spam filtering methods over time. Limitations listed in the last column, corresponding to each article are as acknowledged by the authors themselves. Perusing the different spam techniques and the methods used by researchers to combat spam, taxonomy of spam filtering techniques is presented (Fig. 2) next.

Fig. 2
figure 2

A taxonomy of e-mail spam filtering techniques

3 Publicly Available Datasets

Most of the datasets publicly available are static datasets with very few concept drift datasets. Many authors construct their own image spam or phishing corpus. Table 3 lists public corpora with associated information used in spam filtering experiments.

Table 3 Public corpora used in e-mail spam filtering experiments

4 Future Trends and Conclusion

Models built on old data become less accurate or inconsistent making the rebuilding of the model imperative (called virtual concept drift). Spam filtering is a dynamic problem that involves concept drift. While the understanding of an unwanted message may remain the same, the statistical properties of the spam e-mail changes over time since it is driven by spammers involved in a never-ending arms race with spam filters. Another reason for concept drift could be the different products or scams driven by spam that tends to become popular. The dynamic nature of spam is one of its most testing aspects. An effective spam filter must be able to track target concept drift, swiftly adapt to it, and have a successful mechanism to identify the drift or evolution in spam features.

Content-based spam filtering systems, though widely adopted as a successful spam defense strategy, has unfortunately substituted the spam issue with a false positive one. Such systems achieve a high accuracy but there exists some false positive tradeoff. False positives are more severe and expensive than spam. Reduction of false positives is another domain in email spam analysis where much work needs to be been done on leveraging existing algorithms. Future researches must address the fact that e-mail spam filtering problem is co-evolutionary, since spammers attempt to outdo the advances in predictive accuracy of the classifiers all the time.

One of the biggest spam problems today even as spam e-mail volumes associated with botnets are receding is the snowshoe spam. Showshoe spamming is a technique that uses multiple IP addresses, websites, and sub-networks to send spam, so as to avoid detection by spam filters. Spammers operate by distributing their spam load across a wide footprint of systems to keep from sinking, just as snowshoe wearers do. With many users today migrating to social networks as a means of communication, spammers are diversifying in order to stay in business.

E-mail prioritization is an urgent research area with not much research done. In addition to basic communication, e-mail systems are used for a wide variety of other tasks such as—business and personal communication, advertisements, reminders, management of tasks, and cloud storage, etc. There is a serious need to address the information overload issue by developing systems that can learn personal priorities from data and identify important e-mails for each user. Prioritizing e-mail as per its importance or classifying emails into personalized folders as in [9, 11] is another desirable characteristic in a spam filter. Prioritizing e-mail or perhaps redirecting urgent messages to handheld devices could be another way of managing e-mails.

Fortunately, machine learning-based systems enable systems to learn and adapt to new threats, reacting to counteractive measures adopted by spammers. No single anti-spam solution may be the right answer. A multi-faceted approach that combines legal and technical solutions and more is likely to provide a death blow to such spam. As long as spam exists it will continue to have adverse effects on the preservation of integrity of e-mails and the user’s perception on the effectiveness of spam filters. Overall remarkable advancements have been achieved and continue to be achieved, however, some outstanding problems in e-mail spam filtering as highlighted above still remain. Till more improvements in spam filtering happen, anti-spam research will remain an active research area.