1 Introduction

Digital products and services increasingly mediate human activities. With the advent of email communication, unsolicited emails, in recent years, have become a serious threat to global security and economy (Bergholz et al. 2010). As a result of the ease of communication via emails, a vast number of issues involving the exploitation of technology to elicit personal and sensitive information have emerged. Identity theft, being one of the most profitable crimes, is often employed by felons to lure unsuspecting online users into revealing confidential information such as social security numbers, account numbers, and passwords. Unsolicited emails disguised as coming from legitimate and reputable sources often attract innocent users to fraudulent sites and persuade them to disclose their sensitive information. As per the report by Kaspersky Lab, in the first quarter of 2019, the menace of such unwanted emails was responsible for \(55.97\%\) of traffic (\(0.07\%\) more than that in the fourth quarter of 2018). Unsolicited Bulk Emails (UBEs) can be broadly categorized into two distinct yet related categories: spam and phishing.

Spam emails are essentially UBEs that are sent without users’ consent, primarily for marketing purposes such as selling unlicensed medicines, illegal products, and pornography (Toolan and Carthy 2010). The growth of spam traffic is a worrisome issue as such emails consume a lot of network bandwidth, waste memory and time, and cause financial loss. Phishing emails, on the other hand, are a much more serious threat that involves stealing individuals’ confidential information such as bank details, social security numbers, and passwords. Most of the phishing attacks are focused towards financial institutions (e.g., banks); however, attacks against government institutions, although not as targeted, cannot be overlooked (Bergholz et al. 2010). To understand the impact of phishing, consider pharming, a variant of phishing, where the attackers misdirect users to fraudulent sites through domain name server hijacking (Abu-Nimeh et al. 2007). The effect of spam and phishing on valid users is multi-fold:

  • Generally, UBEs promote products and services with little real value, pornography, get-rich-quick schemes, unlicensed medicines, dicey legal services, and potentially illegal offers and products.

  • UBEs often hijack real users’ identities to send spam to other users (e.g., business email compromise scams such as email spoofing and domain spoofing [\(\approx\) amounted to almost \(\$1.3\) billion in 2018 (20,373 victims), which was twice as much as that in 2017 (15,690 victims) (Bec scams trends and themes 2019)].

  • Phishing, in particular, involves identity theft as financial identity theft, criminal identity theft, identity cloning, or business/commercial identity threat.

  • Mailing efficiency and recipient’s productivity are drastically affected by UBEs.

A study by the McKinsey Global Institute revealed that an average person spends \(28\%\) of the workweek (\(\approx\) 650 h a year) reading and responding to emails (Gang 2017). Additionally, research on SaneBox’s internal data revealed that only \(38\%\) of the emails on an average are relevant and important (Gang 2017), equivalent to \(\approx\)\(11\%\) of the workweek. Furthermore, a study by the Danwood Group found that it takes an average of 64 seconds to recover from an email interruption and return to work at the rate before the interruption (Gang 2017)—adversely affecting the recipients’ productivity, especially in the case of irrelevant UBEs. Based on the Kaspersky Lab report, in 2015, the UBE email volume fell by \(50\%\) for the first time since 2003 (\(\approx\) three to six million). Such decline was attributed to the reduction (in billions) of major botnets responsible for spam and phishing. Conversely, by the end of 2015, the UBE volume escalated. Furthermore, Kaspersky spam report revealed an increase in the presence of pernicious email attachments (e.g., malicious macros, malware, ransomware, and JavaScript) in the spam email messages. By the end of March 2016, the UBE volume (\(\approx\) 22,890,956) had quadrupled in comparison with that witnessed in 2015. In 2017, the Internet Security Threat Report (ISTR) (Symantec 2018) estimated that the volume of spam emails had skyrocketed to an average of \(55\%\) (\(\approx\)\(2\%\) more than that in 2015 (\(52.7\%\)) and 2016 (\(53.4\%\))). Clearly, spam and phishing rates are rapidly proliferating. The overall phishing rate in 2017, according to the ISTR (Symantec 2018), is nearly one in every 2, 995, while the number of Uniform Resource Locators (URLs) related to phishing rose by \(182.6\%\), which accounted for \(5.8\%\) (one in every 224) of all malicious URLs.

Over the years, extensive research in this domain revealed several plausible countermeasures to detect UBEs. Approaches such as secure email authentication result in high administrative overload and hence, are not commonly used. Machine learning and knowledge engineering are two commonly used approaches in filtering UBEs. In knowledge engineering, UBEs are classified using a set of predefined rules. However, knowledge engineering approaches require constant rule updation to account for the dynamic nature of the UBE attacks—often suffer from scalability issues. In machine learning approaches, the algorithm itself learns the classification rules based on a training set—determining the email type through the analysis of the email content and structure has emerged, owing to the success of AI-assisted approaches in UBE classification. This area of research is actively being developed to account for the dynamic nature of UBE attacks. Past works in the existing literature explore several informative features, and many machine learning algorithms have been developed and utilized to classify the incoming mail into junk and non-junk categories (Toolan and Carthy 2010; Chandrasekaran et al. 2006; Toolan and Carthy 2009; Mohammad et al. 2015; Fette et al. 2007; Shams and Mercer 2013). Many leading internet service providers including Yahoo mail and Gmail, employ a combination of machine learning algorithms such as neural networks, to handle the threat posed by UBE emails effectively. Since machine learning models have the capacity to adapt to varying conditions, they not only filter the junk emails using predefined rules but also generate new rules to adapt to the dynamic nature of the UBE attack. Despite the success, adaptability, and predictability of machine learning models, preprocessing, including feature extraction and selection plays a critical role in the efficacy of the underlying UBE classification system (Turner et al. 1999; Michalski et al. 2013). Thus, there is a need to determine the most discriminative and informative feature subset that facilitates the classification of UBEs with a higher degree of confidence.

Due to the vast heterogeneity in the existing literature, there is no consensus on which features form the most informative and discriminative feature set. Moreover, to the best of our knowledge, only a few works have evaluated all the possible set of features and provided insights on the importance of a feature concerning the classification of UBEs.Footnote 1 In this paper, we aim at providing an accessible tutorial to security analysts and scientists seeking to avail benefits from the existing email data. First, we elucidate on the way of extracting vital and informative features (after extensive experimentation, we resorted to the features devised in the seminal work by Toolan and Carthy (2010), to achieve high performance in real-time) from the email corpus. Then, we present six prolific and widely used feature selection (extraction) methods including Variance-based filtering (LowVar), Correlation-based filtering (HighCorr), Feature Importance based filtering (FI), Minimum Redundancy Maximum Relevance (mRMR), and principal component analysis (PCA),Footnote 2 to determine an optimal feature subspace that facilitates effective learnability and generalizability of the underlying machine learning models, thus impacting the predictability of UBEs. Finally, we evaluate the obtained optimal feature subspace using eight state-of-the-art machine learning algorithms including Naïve Bayes (NB), Support Vector Machines (SVM), Bagged Decision Trees (BDT), Random Forest (RF), Extra Trees (ET), AdaBoost (AB), Stochastic Gradient Boosting (SGB), and Voting Ensemble (VE). The key contributions of this paper are mainly four-fold:

  • We discussed the extraction of critical and potential email features with discriminative capabilities concerning UBEs, through the analysis of both email body-content and structure.

  • We leveraged several prolific feature selection (extraction) approaches to engender an optimal informative feature subspace that enables effective and accurate UBE detection and classification.

  • We present an extensive comparative study to elucidate on the applicability, learnability, and generalizability of several state-of-the-art machine learning models in facilitating UBE filtering and classification.

  • To enhance the understanding of the readers, we exposed them to several feature selection and machine learning algorithms through snippets of Python code, enabling them to avail benefits from the existing email data.

The rest of the paper is organized as follows: Section 2 presents an overview of the existing works, and reviews their advantages and limitations, while Sect. 3 presents the background discussion. Section 4 elucidates on the steps employed in the process of feature extraction from emails, feature selection from the extracted email data, and understanding the importance of a feature with respect to UBEs. The machine learning algorithms employed in the UBE classification are presented in Sect. 5. In Sect. 6, we evaluate the obtained feature subspaces using several machine learning algorithms. Finally, Sect. 7 summarizes this paper with future enhancements.

2 Related work

Table 1 Summary of some key past works that employed machine learning to facilitate UBE classification

Utilizing AI-assisted approaches for UBE detection and classification has become a prominent area of global research interest. This section aims at reviewing some of such existing techniques which were utilized in the development and evaluation of a potential set of features in the classification of spam and phishing emails, and to provide an overview of the existing modeling strategies.

Lueg (2005) presented a brief survey exploring the way of applying information retrieval and information filtering mechanisms to postulate spam filtering in a theoretically grounded and logical way. Although the author aimed at introducing an operationally efficient spam detector, the presented survey did not detail the simulation tools, machine learning approaches, or the datasets utilized. Wang et al. (2005) reviewed several approaches of detecting spam emails, categorized unsolicited spam emails into hierarchical folders, and facilitated automatic regulation of the tasks concerning the response to an email. However, the author did not cover any machine learning approaches. Chandrasekaran et al. (2006) published a seminal work in the UBE detection and classification, and their work introduced and employed structural email features such as the content richness and the number of functional words (e.g., bank, credit, and credit) to discriminate phishing emails from legitimate ones. They used an SVM classifier to detect phishing emails and prevent them from reaching the user’s inbox, thus reducing any possible human exposure. The work by Zhong et al. (2006) chronicled an innovative spam filtering approach that ensembled several filters. Abu-Nimeh et al. (2007) compared the accuracies of classifying 2, 889 emails using supervised machine learning (SML) models including SVM and RF using 43 potential features. The authors showed that RF classifier outperformed several other classifiers (low error rate). Despite the novelty and inventiveness in these works (Chandrasekaran et al. 2006; Zhong et al. 2006; Abu-Nimeh et al. 2007), they did not benchmark their approach against the recent works.

Cormack (2008) explored the relationship between email spam detectors and spam detectors in storage media and communication, with emphasis on the efficiency of the proposed methods. Furthermore, the characterization of email spams (e.g., users’ information requirements) was scrutinized by the author. However, the work lacked detailing of certain vital components of spam filters. Sanz et al. (2008) detailed the issues concerning UBE research, the effects of such issues on the users, and the ways of reducing such effects. Their research work elucidated on several machine learning algorithms utilized in UBE detection. However, their work lacked a comparative analysis of various content filters. Ma et al. (2009) used a set of orthographic features to achieve an automatic clustering of phishing emails, which resulted in greater efficiency and better performance via Information Gain (IG) with C4.5 Decision Tree (DT). They used the modified global K-means approach to generate the objective function values (over a range of tolerance values), for selected feature subsets, which assisted in recognition of clusters. Toolan and Carthy (2009) used a recall-boosting ensemble approach which was based on C5.0 DT, and instance-based learning ensemble techniques to reclassify emails that were classified as non-phishing by C5.0 DT. They obtained a good precision through the use of C5.0 DT and \(100\%\) recall from the ensemble. Gansterer and Pölz (2009) proposed a system of filtering the incoming emails into ham, spam, and phishing, based on Feature Selection by Category (FSC), which provided better (97%) classification accuracy (ternary classification) than that resulted from the use of two binary classifiers.

Basnet and Sung (2010) proposed a method of detecting phishing emails through the use of confidence-weighted linear classifiers. The authors only utilized the email contents as features and neglected the use of any heuristic-based phishing specific features. A prominent work in the field of phishing email filtering was presented by Bergholz et al. (2010), where the authors described several novel features including statistical models for email topic descriptions, email text and external link analysis, and the analysis of embedded logos concerning hidden salting. Dhanaraj and Karthikeyani (2013) studied and developed approaches premeditated to mitigate email image spam. Despite the creativeness in designing image-based methods, their work did not elucidate on the machine learning models or the utilized corpus. Zhang et al. (2014) developed an automatic detection approach specific to Chinese e-business websites by using the URL and website-content specific features. The authors employed four machine learning classifiers including RF, Sequential Minimum Optimization (SMO), logistic regression, and Naïve Bays (NB), and evaluated their results using Chi-squared statistics (\(\chi ^2\)). Laorden et al. (2014) explained the importance of anomaly discovery in UBE filtering in reducing the requirement of classifying UBEs. Their work reviews an anomaly-based UBE sieving approach which utilized a data minimization approach that reduced preprocessing while maintaining the information about email message appropriateness concerning the email nature. Table 1 reviews other related and significant past works in the detection of spam and phishing emails.

Table 2 Summary of some key existing works in the field of UBE detection and filtering

More recently, many works aimed at studying the applicability of different machine learning approaches including K-Nearest Neighbors (KNN), SVM, NB, neural networks, and others, to spam and phishing email filtering, owing to the ability of such approaches to learn, adapt, and generalize. In 2016, a broad overview of some of the state-of-the-art content-based UBE filtering approaches was presented by Bhowmick and Hazarika (2016). Their work surveyed several vital concepts in UBE filtering, the effectiveness of the current efforts, and recent trends in UBE classification, while focusing on popular machine learning approaches for the detection of the nature of an email. Moreover, they discussed the changing nature of UBE attacks and examined several machine learning algorithms to combat the menace of such emails. Sah and Parmar (2017) proposed a model to effectively detect the malicious spam in emails through effective feature selection, followed by classification using three machine learning approaches including NB, SVM, and Multi Layer Perceptron (MLP). With the promising success of deep neural architectures in various applications (Gangavarapu et al. 2019c; Jayasimha et al. 2020), some of the recent works have employed deep learning models to classify UBEs. Apruzzese et al. (2018) evaluated the applicability, effectiveness, and current maturity of deep and machine learning models in the detection of malware, intrusion, and spam. The authors concluded that utilizing different machine learning classifiers to detect specific tasks can increase the UBE detection performance; however, they drew no significant conclusions concerning deep neural models. Hassanpour et al. (2018) modeled the email content as Word2Vec style features and classified them using several deep learning classification approaches—the authors achieved an overall accuracy of 96%. Vorobeychik and Kantarcioglu (2018) used adversarial machine learning to generate email samples and trained the classifier to distinguish those generated samples, making the learning model robust to adversarial manipulation and decision-time attacks. The authors concluded with a note on several issues concerning adversarial modeling that warrant further research. More prominent and impactful research works in the domain of UBE detection and filtering are tabulated in Table 2.

Some of the works presented in Table 2 employed feature-free approaches to facilitate spam and phishing detection. However, such approaches suffer from high computational complexity and cost of training. Some research works considered email header, subject line, and body as the most prominent features in classifying UBEs. However, it is worth noting that, suspicious email header, subject line, and body could be misleading, and behavior-based email features could be essential to facilitate accurate classification of UBEs. Most of the researchers focused on the classification performance in terms of classification accuracy. This work differs from the efforts of previous works by revisiting various state-of-the-art machine learning approaches for UBE classification. We employ feature selection to kindle an optimal feature subspace to lower the computational complexity and enhance the classification performance. Additionally, we present several key performance indicators other than classification accuracy to assess the performance of the underlying models accurately. Furthermore, we present an accessible tutorial to security specialists through snippets of Python code that is intended on exposing them to the presented feature selection and machine learning algorithms.

3 Background

Certain email features (e.g., keywords such as debit, verify, and account) are more prominent in UBEs than in ham emails, and by measuring the rate of occurrence of such features, we can ascertain the probabilities for those email characteristics which in turn aids in the determination of the email type. The existing literature presents a wide variety of techniques to determine and utilize such discriminative features, and in this section, we describe the different categories of UBE filtering approaches widely used to overcome the menace of such emails. We also elucidate on the UBE filters widely used by popular internet service providers to curtail the dangers posed by email-borne malware, phishing, and malware in UBEs.

3.1 Categorization of the existing UBE filtering techniques

Over the years, academicians and researchers have proposed various UBE detection and filtering approaches which have been utilized successfully to classify email data into groups. These approaches can be broadly categorized into: content-based and behavior-based filters, sample base or case base filters, rule-based or heuristic filters, previous likeness based filters, and adaptive filters.

3.1.1 Content-based and behavior-based filters

Content-based and behavior-based UBE filtering approaches aim at analyzing the email content and structure to create automatic classification rules using machine and deep learning approaches such as KNN, NB, MLP, and neural networks. Content-based and behavior-based filters analyze the tokens (words), their distribution, their occurrences and co-occurrences, in addition to the analysis of scripts and URLs, in the context of emails, and then utilize the learned knowledge to generate rules to facilitate automatic filtering of incoming UBE emails (Christina et al. 2010).

3.1.2 Sample base or case base filters

Sample base or case base filtering techniques are popular in spam and phishing email filtering. Through an email collection model, all the emails, including ham, spam, and phishing, are extracted from every user’s email. Then, preprocessing of the raw email data into a machine-processable form is facilitate through feature selection (extraction) and grouping the email data. Finally, the preprocessed data is mapped into distinct UBE categories, and a machine learning algorithm is employed to train the existing email data. The trained models are then tested on the incoming emails to categorize them into ham, spam, or phishing (Christina et al. 2010).

3.1.3 Rule-based or heuristic filters

Rule-based or heuristic UBE filtering approaches [e.g., SpamAssassin (Mendez et al. 2006)] utilize the existing heuristics or rules to assess several patters (specifically, regular expressions) against an incoming email message—the score of an incoming email is reliant on the number of patterns in the email message (when the patterns in the email message do not correspond to the preset regular expressions, the score is reduced). The UBE emails are then filtered using a specific predetermined threshold. While certain heuristics do not change over time, other heuristics require constant updating to cope with the changing and dynamic nature of the UBE emails (Christina et al. 2010).

3.1.4 Previous likeness based filters

Previous likeness based UBE filtering approaches utilize instance-based or memory-based machine learning approaches to classify the incoming email messages based on their likeness and resemblance to the stored training sample emails. A multi-dimensional vector is created using the attributes of the sample emails, which is then used to plot new instances. A new instance is mapped to a target class using the most common class among the K-nearest neighbors of the point (Sakkis et al. 2001). Finally, the KNN classifier is employed to classify the incoming email messages.

3.1.5 Adaptive filters

Adaptive UBE filtering approaches facilitate the detection and classification of UBEs by categorizing emails to distinct groups. In this approach, the email corpus is segregated into several groups, and each group poses an emblematic text. The similarity between an incoming email and a particular group determines the email message score with respect to that particular group. The scores computed across all the groups are utilized in deciding the most probable group concerning the incoming email message (Pelletier et al. 2004).

3.2 UBE filters: how yahoo mail and gmail filter UBEs

Leading internet service providers including Yahoo mail and Gmail have employed several machine learning approaches such as neural networks, to handle the threat posed by UBEs effectively. Recent research revealed that the machine learning model employed by Google facilitates the detection of UBEs with \(99.9\%\) classification accuracy—one in a thousand email messages succeeds in evading the UBE filter in Gmail. To account for the considerable UBE volume (\(\approx\) 50–70% of the emails), the UBE detection models developed by Google incorporate Google safe browsing tools to identify websites with malicious URLs. The performance of UBE filtering is enhanced further through additional, comprehensive scrutiny of phishing emails. Such more in-depth examination causes additional delay; however, only \(0.05\%\) of the emails are subject to such delay. Further details on the email UBE filters employed by popular internet service providers are presented in the following subsections.

3.2.1 Yahoo mail UBE filtering

Yahoo mail is one of the first free webmail service providers with more than 320 million users. Yahoo mail utilizes several algorithms and a combination of methods rooted in basic techniques, including spam and email content users’ complaints and URL filtering. The email provider employs email filtering by domains rather than by IP addresses. Furthermore, Yahoo mail provides ways of preventing a valid internet user for being mistaken for a cybercriminal (e.g., ability to troubleshoot SMTP errors using SMTP logs). The complaint feedback loop service helps users maintain trust in the services and UBE filtering approaches employed by Yahoo mail. Moreover, the email service provider also facilitates Yahoo whitelisting (return path certification and internal whitelisting)—whitelisting rolls back to the user to specify the list of senders to receive email messages from (placed in a list of trusted users), unlike in blacklisting. The service user can employ a combination of Yahoo’s spam-fighting techniques along with whitelisting to reduce the volume of legitimate emails being erroneously classified as unsolicited emails. Whitelisting alone can result in a strict implication on unapproved senders, in which case, Yahoo mail utilizes an automatic whitelisting procedure, where the anonymous sender’s address is checked against a database for any history of spamming or phishing—if the unapproved user has no record of cyber attacking, the email message is sent to the recipient, and the user’s email is added to the whitelist.

3.2.2 Gmail UBE filtering

Google mail employs hundreds of rules to determine the nature of an incoming email—each rule depicts a specific feature or aspect of a UBE with some statistical value which is reliant on the likelihood that a particular feature corresponds to UBEs. The weighted importance of the features is utilized to determine the final score for an incoming email message. The score is measured against a sensitivity threshold determined using each user’s UBE filter, and consequently, an incoming email is classified as ham or unsolicited. Unlike Yahoo mail, Gmail filters email messages by IP addresses rather than by domains. To facilitate accurate classification of UBEs, Gmail utilizes state-of-the-art machine learning algorithms including neural networks and logistic regression. Additionally, to shield Gmail users from any possible image UBEs, Google utilizes optical character recognition. Furthermore, the UBE filtering by Gmail is greatly enhanced by linking several features through the use of machine learning algorithms utilized in combining and ranking large sets of Google search results. Factors like links in the email message headers and domain reputation depict the evolving and dynamic nature of the UBEs over time—due to these factors, legitimate emails could be classified as UBEs. With the emergence of state-of-the-art algorithms, tools, users’ feedback, and new UBE discovery, the filtering settings are updated continuously.

Fig. 1
figure 1

An overview of the procedure employed to draw inferences from the collected data

4 Methods: feature extraction and selection

In this section, we focus on describing the way of processing the raw email dataFootnote 3 based on forty discriminative features devised by Toolan and Carthy (2010), to facilitate the detection of spam and phishing emails. Moreover, we elucidate on determining the importance of a feature concerning the features of UBEs. The following subsections give tactful insights on the entire procedure employed as a part of feature engineering, which deals with the process of transforming raw email data into informative and discriminative features that better represents the underlying email corpus. Such representations aid the classification models to learn, adapt, and generalize, which is essential in the accurate classification of unseen email instances. The entire workflow of the procedure employed to draw informative inferences from the raw email data is depicted in Fig. 1. The text is accompanied by snippets of Python code to familiarize the readers with the methods utilized in this study. The code is aimed at readers with Python familiarity, more resources concerning the same can be found at https://www.python.org/about/gettingstarted/.

4.1 Materials: raw email corpus

Table 3 Summary of the email corpora utilized in this study

Most of the existing publicly available datasets including spam archive (Almeida and Yamakami 2010; Biggio et al. 2011), phishing corpus (Abu-Nimeh et al. 2007), and Princeton spam image benchmark (Wang et al. 2007) are lopsided towards UBE detection—the volume of UBEs utilized in evaluating the filter is much greater than that of ham emails, resulting in the machine learner recording a higher accuracy by concentrating solely on detecting UBEs, which might not scale well with the real-world data. Hence, a more suitable dataset is the one with near equal volumes of ham and non-ham emails, thus facilitating the underlying machine learner to learn and discriminate between ham emails and UBEs. The raw email data used in this paper consists of around 3844 emails in total, which is comprised of 2, 551 ham emails (\(\approx 66.4\%\)), 793 phishing emails (303 from 2015 and 490 from 2016, contributing to \(\approx 20.6\%\)), and 500 spam emails (\(\approx 13\%\)). These emails were collected from a variety of sourcesFootnote 4—the spam and ham emails were collected from the SpamAssassin project (2002) (Mendez et al. 2006), while Nazario (2018) provided the phishing emails (see Table 3). We mine these emails to extract the information needed to facilitate the accurate classification of those emails into ham, spam, and phishing emails. To clarify the methods and techniques presented in this study and present all the intermediate results, we use the test email presented in Block 1. Note that the test email is constructed in a way that includes most characteristics of a UBE—such a choice can help mitigate the sampling problem while presenting intermediate results.

figure a

From the test email in Block 1 it can be observed that an email contains additional ‘metadata,’ including reply-to address, from address, to address, and others (lines 1 to 26), that can be explored to aid in the classification of the email into ham, spam, or phishing. The following subsection presents a detailed discussion on the features of a given email [derived from Toolan and Carthy (2010)] that are prominent in the prediction of the nature of an email.

4.2 Preprocessing and feature extraction: obtaining informative feature space

In this section, we discuss the features employed in this study to transform raw email data into a machine-processable form. These features are internal to the emails and are not derived from external sources such as search engine information, spam assassin score, or domain registry information. Such external features were neglected, owing to the fact that such information might not be present always, and hence cannot be a part of a truly automated UBE filtering system. Moreover, research has shown that features internal to emails form a comparatively more informative feature set as most of the external data, including search engine results or domain name service information changes regularly.

As stated earlier, we carried out several experiments on the obtained email corpus to determine a suitable feature space that best represents the underlying corpus. These experiments included the utilization of advanced content-based features and topics extracted using paragraph vector network (vector size of 200) and hierarchical Dirichlet process (150 topics); however, the addition of such sophisticated features did not enhance the classification performance, and instead increased the computational complexity of training. Additionally, we employed the genetic algorithm (population size of 50, crossover rate of 0.6, and mutation rate of 0.1 for 25 iterations) to facilitate feature selection among the advanced content-based features and topics—this resulted in the proliferation of the training time with no significant improvement in the performance. The final feature space used in this study employed forty informative features with the capabilities of spam and phishing email discrimination, and they can be roughly divided into five distinct categories:

  • Body-based features: that features that are extracted from the email message content.

  • Subject line based features: the features that are extracted from the subject line of the email.

  • Sender address based features: the features that are extracted from the information about the email address of the sender.

  • URL-based features: the features that are extracted from the anchor tags of HTML emails.

  • Script-based features: the features that are extracted from the information concerning the presence or absence of scripts in the email and the impact of such scripts.

The feature space composed of forty features is tabulated in Table 4. These features include nine body-based, eight subject line based, four sender address based, 13 URL-based, and six script-based features.

Table 4 The forty features utilized in the transformation of raw email data for the determination of the nature of an email

Note the presence of features like body_numFunctionWords, body_suspension, body_verifyYourAccount, subject_verify, subject_debit, and subject_bank—these features require exact word-to-word match, and their values could be easily miscalculated through deliberate spelling errors, unattended typographical errors (e.g., ‘bank’ and ‘bnak’), or the usage of verb forms (e.g., ‘bank’ and ‘banking’). To cope with these shortcomings and obtain a standard canonical form from the raw email textual entries, we used the implementations in the Python NLTK library. The canonical form was obtained through tokenization, stemming, and lemmatization. In tokenization, we aimed at transforming the given text in the raw email entry into smaller words (tokens). Then, we facilitated suffix stripping using stemming, followed by lemmatization to convert the suffix stripped words to their base forms. Moreover, to handle spelling and typographical errors, we employed Jaro similarity scoring (Gangavarapu et al. 2019a, b) (through the implementations in the Python textdistance library) between the intended word spelling and the actual spelling. The Jaro similarity score is normalized (range of [0, 1]), and is given by,

$$\begin{aligned} \text {Jaro}(t_i, t_j) = {\left\{ \begin{array}{ll} 0, &{} \text {m = 0} \\ \frac{1}{3}\left( \frac{m}{|t_i|} + \frac{m}{|t_j|} + \frac{2m - T}{2m}\right) , &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where \(t_i\) (of length \(|t_i|\)) and \(t_j\) (of length \(|t_j|\)) are the tokens under comparison with m matching characters and T transpositions. The threshold that determines if two tokens under comparison are the same was set to 0.9. The code in Block 2 details the entire preprocessing process utilized to obtain a canonical form. Thus, we mitigated the shortcomings arising due to spelling errors, typographical errors, and irregular verb forms.

figure b

4.2.1 Using python for feature extraction

Feature extraction aims at transforming raw email data into informative features that best represent the data without any loss of information. In our email corpus, we have 3, 844 emails (see Sect. 4.1). As explained in Sect. 4.2, we need to extract forty features (refer Table 4) from the collected raw email data. Before extracting the features, it is vital to parse the email to obtain the email body, subject line, sender address, reply-to address, modal URL, and all the links. We utilized the implementations in several Python libraries including re, urlparse, BeautifulSoup, email, HTMLParser, and IPy. Before proceeding any further, ensure that the encoding is set to UTF-8. The code in Block 3 elucidates on the way of extracting several parts (e.g., email body) from a raw email.

figure c

The implementations in the Python email library provide extensive support to handle and parse email data and multipurpose internet mail extensions. First, we extracted the raw email data from the string format into the email format, which was then utilized to extract various parts of the email. To ensure the consistency in the encoding of UTF-8, we first decoded the required field and then encoded it in Unicode. The modal domain is the most frequently used domain in the email (Fette et al. 2007). Finally, to find all the links in the email, we needed to extract all the URLs linked in the form of href, as well as those present just as such in the email, i.e., both anchor links and non-anchor links comprising both internal and external email links. We used the implementations in the Python lxml library, which is a simple and powerful API to parse both XML and HTML. Now that we have extracted various parts of the email, we need to obtain the features from each part, as shown in Table 4.

figure d

Since most of the body-based features such as body_numWords, body_richness, body_numCharacters, and others are easier to extract, we have only shown the process of extracting and checking for HTML tags and forms in the email (see Block 4). All the subject line based features are easily implementable through elementary Python programming modules.

figure e

Utilizing the utility methods listed in Block 5, we can straightforwardly obtain sender address based features. Note that the sender address in the email is not merely the address, but is usually of the form: “Tushaar Gangavarapu” \(<tushaar@nitk.edu.in>\) (Toolan and Carthy 2010). URL based features are among the most important in the determination of the nature of the email, and most of the URL-based features are related to IP addresses. We use the implementations in the Python IPy package to facilitate the extraction of URL-based features (see Block 6).

figure f

Note that, the function IP(.) uses the dotted IP format without the port number; thus, if the port number is present in the IP address, it must be excluded before any further processing. Moreover, while obtaining the count of the domains in the email, we must include the domains of both the sender and the reply-to addresses. All the other URL-based features such as url_ports, url_numPorts, and others can be implemented effortlessly using the above-established methods. Finally, we show how to mine for script-based features from the email body in Block 7.

figure g

Using the above utility methods, we can easily verify if JavaScript comes from outside the modal domain. Table 5 shows the scores of all the forty features concerning the test email presented in Block 1. Now that we have obtained the feature space (forty informative features) from the given email, the subsequent step would be to measure the importance of each feature, to understand the contribution of each feature towards the determination of the nature of a given email.

Table 5 The scores of all the forty features concerning the test email

4.3 Feature selection: engendering optimal feature space

In this study, we employ three combinations of the available ham (\(\mathcal{H}\)), spam (\(\mathcal{S}\)), and phishing (2015: \(\mathcal{P}_{2015}\), 2016: \(\mathcal{P}_{2016}\)) email data, to obtain three datasets, as shown in Table 6. The first dataset comprises ham and spam components, and is aimed at investigating the efficacy of the proposed approaches in spam detection, while the second dataset comprises ham and phishing components, and investigates on the efficacy of the proposed techniques in phishing detection. Such individual analysis is useful in understanding and analyzing the relative importance of features in spam and phishing email detection, respectively. The third dataset comprises all the three components and reflects the fact that real-world email data is composed of ham, spam, and phishing email data. All the experiments performed in this study employ these three datasets.

Not all the features in the obtained feature space contribute towards the accurate classification of the email type, which makes it mandatory to eliminate features of negative or no importance.Footnote 5 We aim at introducing a few of the many feature selection (extraction) techniques, including mRMR (Peng et al. 2005) and PCA (Pearson 1901).

Table 6 Statistics of the datasets utilized in this study

One of the prominent considerations of feature selection (extraction) techniques is the determination of the number of features (dimensions, denoted by \(k\)) to extract. There exists no single method to determine \(k\); it is application dependent—a smaller number of dimensions suffice while obtaining insights about the data, while the same is not valid while developing predictive models (Kosinski et al. 2016).

4.3.1 Obtaining the optimal threshold for threshold-based approaches

Several feature selection approaches, including missing values filter and low variance filter, require a threshold to be preset—the threshold is primarily dependent on the input data. That being said, the preset threshold determines if a given feature is important enough to affect the classification or not. Lower values of the threshold include most of the features from the given feature space, thus under-fitting the data, while higher values of the threshold exclude most of the features, causing the loss of critical information. Hence, finding an optimal threshold that facilitates optimal feature selection is vital. The procedure described in Algorithm 1 elucidates on the process of obtaining the optimal threshold. The procedure described in Algorithm 1 utilizes certain utility functions that:

  • scoreFn(featureColumn): returns the score that is specific to a feature selection technique (e.g., variance in case of low variance filter) for a given feature column.

  • compareFn(score, threshold): returns a Boolean value that is subject to a technique-specific comparison of the score and the threshold (e.g., score < threshold, returns true for variance filter and feature importance filter, and false for missing values filter).

This procedure (Algorithm 1) is dependent on the underlying machine learning algorithm that is used to compute the performance (accuracy); this study employs an extensive study involving eight state-of-the-art machine learning algorithms (see Sect. 6). Thus, to accommodate all the utilized machine learning algorithms, we chose the smallest, most frequently occurring threshold. Note that the thresholds were computed using the training datasets, and then were utilized on the testing datasets.

4.3.2 Handling the missing attribute values

Usually, handling missing values is accomplished through either deletion techniques such as pair-wise deletion and list-wise deletion, or imputation techniques such as hot-deck imputation, cold-deck imputation, and Monte Carlo simulation based multiple data imputation. In most of the cases, if a data column (feature) has only \(5\%\) to \(10\%\) of the required data, then it is less likely to be useful in the classification of most samples (Silipo et al. 2014). The missing values ratio captures a relative value indicating the number of missing rows, and this value compared with the preset threshold to infer if data is to be subject to deletion or imputation. The missing values ratio is computed as:

$$\begin{aligned} \text {Missing}\,\text{values}\,\text{ratio} = \frac{\text {Number of missing rows}}{\text {Total number of rows}} \end{aligned}$$
(2)
figure h
figure i

The procedure followed in handling missing attribute values is explained in Algorithm 2. In this procedure, we utilize the utility function missing(featureColumn), which returns a list of missing rows in the given feature column. The preset threshold value used in Algorithm 2 can be computed using the procedure in Algorithm 1, with a step value of 0.1 (Silipo et al. 2014). Since the datasets utilized in this study have been programmatically mined, we have considered all possible cases, to avoid any missing values.

4.3.3 Feature selection using low variance filter (LowVar)

figure j

One of the many ways of measuring the contribution of a feature (data column) towards the classification performance, is by measuring the variance (sample varianceFootnote 6) of the values of that feature. Variance measures the amount of dispersion provided by the values in the given data, and evidently, zero variance is the limiting case, where the values of a feature are constant; such a case offers no inference. Variance (Var(.)) is computed as:

$$\begin{aligned} \texttt {Var}(X) = \frac{1}{N-1}{\sum _{x_i \in X}(x_i - \bar{x})^2} \end{aligned}$$
(3)

where \(\bar{x}\) is the arithmetic mean of X. The computed variance is compared with the preset threshold [the threshold obtained using the procedure in Algorithm 1, with a step value of 0.01 (Silipo et al. 2014)] to infer about the contribution of a feature in the classification performance—this study employs a preset threshold of 0.01 for the LowVar approach.

The procedure to remove the features with low variance is described in Algorithm 3. Note that the feature values are normalized prior to low variance filtering, to avoid any unnecessary bias arising due to the data irregularities. It is interesting to note that, by using the correlation between a feature and the target variable as the scoring scheme instead of variance, we obtain a low correlation filter.

4.3.4 Removing redundancy by measuring correlation (HighCorr)

figure k

Sometimes, the features in a dataset are correlated, i.e., they depend on one another, and thus carry nearly the same information (data redundancy). All redundant features can be replaced with one of the redundant features, without any loss of information. Such replacement can reduce the computational time and enhance prediction accuracy. In this paper, we utilize the Pearson correlation coefficient, denoted by Corr(\(X_1,X_2\)) (Pearson 1920; Nagelkerke 1991) [other correlation measures include Kendall Tau correlation and Spearman rank correlation (Bolboaca and Jäntschi 2006)] and given by:

$$\begin{aligned} \texttt {Corr}(X_1,X_2) = \frac{\texttt {E}[(X_1 - \bar{x_1})(X_2 - \bar{x_2})]}{\sqrt{\texttt {Var}(X_1)} \cdot \sqrt{\texttt {Var}(X_2)}} \end{aligned}$$
(4)

where \(\bar{x_1}\) and \(\bar{x_2}\) denote the arithmetic means of \(X_1\) and \(X_2\) respectively, and E[x] denotes the expected value of x.

Algorithm 4 details the procedure to eliminate redundancy using a correlation-based filter. Correlation computed using Eq. 4 is compared with a preset threshold [the threshold obtained using the procedure in Algorithm 1, with a step value of 0.1 (Silipo et al. 2014)] to infer if a feature is to be included or excluded in the classification—this study employs a preset threshold of 0.5 for the HighCorr approach.

4.3.5 Measuring feature importance using the random forest classifier (FI)

RFs often referred to as DT ensembles, can be utilized for feature selection (Díaz-Uriarte and De Andres 2006). We can obtain the importance of a feature by using a broad set of carefully constructed trees against the prediction attribute and analyzing the usage statistics of each feature. The process of obtaining the feature importance involves the creation of shallow trees and checking if an attribute appears as a splitting attribute in most of the constructed trees, in which case, that particular feature is regarded as informative. Upon the generation of the ensemble tree, each feature in the feature space can be scored against the number of times that specific feature has been selected as the splitting attribute and at which level of the tree it has been selected. This method is usually robust to noise and is usually faster than boosting and bagging (Breiman 2001).

Usually, feature importance is computed as the Gini impurity or the mean decrease in the impurity (Breiman 2017, 2002; Louppe et al. 2013), which measures the total decrease in the node impurity—a measure of the decrease in the classification performance decreases upon dropping a particular feature. The value of FI for a feature (\(X_m\)) can be computed as:

$$\begin{aligned} \texttt {Imp}(X_m) = \frac{1}{N_T}{\sum _{t=1}^{T}\sum _{n \in \phi _t}(v_{n} = m)\left[ p(n) \cdot \Delta i(n)\right] } \end{aligned}$$
(5)

where \(N_T\) is the number of trees, \(\phi _t\) denotes the tth tree structure, n is a node in the tree \(\phi _t\), \(v_n\) denotes the variable at a node n, p(n) is the measure \(N_n/N\) of the samples reaching a node n, and \(\Delta i(n)\) denotes the impurity reduction (e.g., Shannon entropy, Gini index, and variance of the target variable) at node a n. The impurity reduction at a node n is given by (R: right, L: left):Footnote 7

$$\begin{aligned} \Delta i(n) = i(n) - \frac{N_{n_L}}{N_n}i(n_L) - \frac{N_{n_R}}{N_n}i(n_R) \end{aligned}$$
(6)

Upon the computation of the importance of all the features in the feature space using Eq. 5, the FI scores are compared with a preset threshold [the threshold obtained using the procedure in Algorithm 1, with a step value of 0.01 (Silipo et al. 2014)] to infer if a feature is to be included or excluded in the classification—this study employs a preset threshold of 0.06 for the FI approach.

4.3.6 Feature selection using minimum redundancy maximum relevance (mRMR)

The mRMR approach (Peng et al. 2005; Chanduka et al. 2018) is an information-based incremental feature selection technique (filter approach) that aims at integrating the relevance [defined as the distributional similarity between the feature vector and the target vector (Auffarth et al. 2010)] and redundancy (\(\propto\) 1/robustness) information into a single scoring function. Relevance can be measured through mutual information (MI) between the given two random variables. MI quantitatively measures the amount of information (bits or Shannons) that two random variables share, and is given by (holds for discrete variables, for continuous variables we integrate over all values of \(X_1\) and \(X_2\)):

$$\begin{aligned} \texttt {MI}(X_1;X_2) = \sum _{x_2 \in X_2}\sum _{x_1 \in X_1} Pr (x_1, x_2) \cdot \left( \frac{ Pr (x_1, x_2)}{ Pr (x_1)\text { } Pr (x_2))}\right) \end{aligned}$$
(7)
$$\begin{aligned} \texttt {MI}(X_1;X_2) = H(X_1) + H(X_2) - H(X_1,X_2) \end{aligned}$$
(8)

where \(Pr (x_1, x_2)\) denotes the joint probability, which measures the likelihood of both \(x_1\) and \(x_2\) occurring together, and is estimated by a histogram or a kernel-based Probability Density Function (PDF) estimator of one or two variables; \(Pr (x_i)\) denotes the marginal probability of \(X_i\). MI can be expressed in terms of entropy (see Eq. 8), where the entropy measures the uncertainty of a random variable (Vergara and Estévez 2014) and can be computed as:

$$\begin{aligned} H(X) = -\sum _{x_i} p(x_i)\cdot \log _2(p(x_i)) \end{aligned}$$
(9)

Ultimately, we aim at maximizing MI(\(X^{'};Y\)), where \(X \in \mathbb {R}^d\) and \(X^{'} \in \mathbb {R}^k = \{x^{(1)}, x^{(2)},\cdots ,x^{(k)}\}\), \(k < n\). It is hard to estimate the joint probability of high-dimensional variables using a histogram or a kernel-based PDF, as the number of samples needed to estimate the PDF exponentially increases with the increase in the number of dimensions (Rossi et al. 2006). To cope with this issue, we modified the objective function so as to estimate with the available samples.

It is essential to understand that the features contributing to a high MI index need not necessarily be non-redundant, and hence it is crucial to consider redundancy along with MI, to obtain an optimal representative set of k features. The objective function \(\Phi\) (mRMRFootnote 8) is employed to balance the trade-off between redundancy and relevance; is computed using:

$$\begin{aligned} \Phi = R - R^{-} = \frac{1}{|X^{'}|}\sum _{x^{(i)}} \texttt {MI}(x^{(i)};Y) - \frac{1}{|X^{'}|^2}\sum _{x^{(i)},x^{(j)}} \texttt {MI}(x^{(i)};x^{(j)}) \end{aligned}$$
(10)

where R measures the average relevance of the feature vectors with the target vector, while \(R^{-}\) captures the average pair-wise redundancy among the selected features, and thus, by maximizing the objective function, we can obtain an optimal feature subspace. The incremental approach is facilitated by adding one feature at a time to the set \(X^{'}\), starting from the feature that maximizes the objective function. For every feature addition, the cross-validation classification error is computed—the reduced feature space is the subspace with the least classification error. In this study, we utilize the mRMR feature selection approach as a wrapper approach, with C4.5 DT and 10-fold cross-validation. Moreover, binning was employed ton discretize the continuous data, before subjecting the data to mRMR feature selection.

Sometimes, the mRMR approach generates high error thresholds (as high as \(34\%\)). Moreover, mRMR only considers pair-wise interactions (see Eq. 10); by considering higher-order interactions, we can obtain more informative subspaces. Maximum Joint Relevance (MJR) (Yang and Moody 2000) and adaptive MJR (Jiao et al. 2015) are a few of the modified mRMR algorithms that are aimed at tackling these shortcomings.

4.3.7 Feature extraction using principal component analysis (PCA)

PCA is an unsupervised approach that aims at converting a set of observations of (possibly) correlated variables into a set of values of uncorrelated variables (principal components) using orthogonal transformations (Pearson 1901; Wold et al. 1987). PCA aims at maximizing the variance of the data in a new dimensional space. PCA produces the same number of orthogonal dimensions as that of the initial data, but what makes PCA interesting is that the eigenvalues corresponding to these eigenvectors (principal components) monotonically decrease as we move away from the first principal component. The dimension with an eigenvalue of approximately zero value (zero variance) does not provide any information in the original space and can be considered to be irrelevant.Footnote 9

PCA usually provides the best reconstruction, i.e., the loss of information from the transformation is minimal, and this can be attributed to the fact that PCA only performs linear transformations. PCA makes a compelling assumption of the presence of a linear relationship between observed variables, and also that all the data points are Independent and Identically Distributed (IID). Consider PCA for a single dimension subspace, where \(X \in \mathbb {R}^d\) and {\(x_1\), \(x_2\), \(\dots\), \(x_n\)} are IID distributions of X (\(d \ll n\)). We aim at maximizing \(u^T \Sigma u\) subject to \(u^Tu = 1\), where \(\Sigma\) is the covariance matrix (\(\in \mathbb {R}^{d \times d}\)), and u is a principal component (\(\in \mathbb {R}^{d \times k}\)). Using Lagrange multipliers (Klein 2004), we obtain \(\Sigma u = \lambda u\), for some \(\lambda\). So, u is an eigenvector of \(\Sigma\), with an eigenvalue of \(\lambda\).

The preprocessing steps in PCA include zeroing out the mean of the data, and normalizing the variance of each coordinate, to ensure they are all measured on the same scale. Then, we compute \(\Sigma\), followed by the computation of eigenvalues and eigenvectors. If we intend on projecting the data into a \(k -\)dimensional space (\(k < n\)), we should choose top\(- k\) eigenvectors of \(\Sigma\), i.e., {\(u_1\), \(u_2\), \(\dots\), \(u_k\)}, which then form the basis of the new orthogonal space. Any given data point \(X \in \mathbb {R}^d\) can be represented in the new basis as:

$$\begin{aligned} X^{'} = u^T X = \left[ u_1^TXu_2^TX\cdots u_k^TX\right] ^T\text {; }X = \left[ x^{(1)}x^{(2)}\cdots x^{(d)}\right] ^T \end{aligned}$$
(11)

Now, we know that all the dimensions in the projected space are orthogonal, and thus, we can ensure that the variables are uncorrelated. PCA is comparatively fast, owing to the ease of computation concerning eigenvectors (Hand 2007). Furthermore, PCA provides the ease of interpretability and visualization. In this study, we only retained those principal components of PCA that accounted for \(90\%\) of the variance.

4.3.8 Using python for feature selection (extraction)

In this section, we explain the way of obtaining an optimal feature subspace from the given feature space through LowVar, HighCorr, FI, mRMR, and PCA approaches, using Python. The low variance filter and high correlation filter can be implemented by following the procedure in Algorithm 3 and Algorithm 4, respectively. Alternatively, the implementations in the Python pandas.corr (for high correlation filter) and sklearn.feature_selection.VarianceThreshold (for low variance filter) can be utilized to achieve the same (see Block 8).

figure l

To obtain the importance of the features in the obtained feature space using the RF classifier, we utilized the implementations available in the Python sklearn.ensemble.RandomForestClassifier library. The code in Block 9 elucidates on the implementation details concerning the computation of the FI. Note that the code presented here utilizes 100 classification and regression trees with a maximum depth of 2.

figure m

To implement the mRMR approach in Python, we utilize the implementations in the pymrmr library. The code in Block 10 details the process of feature selection using mRMR. The code presented here takes as the input, a discretized dataframe, a method of internal selection (MID or MIQ), and the value of k (number of dimensions). To discretize a continuous attribute (\(X^{(i)}\)) based on two thresholds, we use \(\texttt {Mean}(X^{(i)}) \pm (\psi \times \texttt {Var}(X^{(i)}))\), where \(\psi\) can be 0, 0.5, or 1 (Peng et al. 2005).

figure n

Finally, to perform PCA and find the directions of maximum variance using Python, we employ the implementations in the sklearn.decomposition.PCA library. Upon fitting the PCA model, the principal components and eigenvalues can be accessed via components_ and explained_variance_ attributes.

figure o

5 Methods: email classification

In recent years, most researchers have resorted to machine learning approaches to detect and differentiate between ham, spam, and phishing emails. Machine learning algorithms facilitate a sense of experience-based learning, resulting in the automatic generation of adaptive classification rules, in turn enhancing the performance. Such adaptive and automated approaches outperform blacklisting or rule-based filtering approaches which rely on hand-coded rules susceptible to the changing nature of spam and phishing email attacks. In this section, we review eight state-of-the-art machine learning algorithms employed in UBE classification. The Python code is presented in-line with the text, to aid readers to implement the proposed classifiers.

5.1 Classification using Naïve bayes (NB)

The NB classifier exemplifies both supervised learning and statistical learning. NB serves as a straightforward probabilistic approach that classifies the input email data by influencing the probabilities of the outcomes. The Bayesian classification merges the experimental data with the previous knowledge, and can solve both predictive and analytical problems. Furthermore, the NB algorithm is robust to noise, and computes likelihoods for postulation. Note that, the NB classifier is based on the Bayes theorem with a sound assumption of independent events. The Bayes probability theorem is an autonomous characteristic model (Wu and Deng 2008; Issac and Jap 2009), and is given as:

$$\begin{aligned} Pr({\text{class}}|(x_{1} ,x_{2} , \ldots ,x_{n} )) & = \frac{{Pr((x_{1} ,x_{2} , \ldots ,x_{n} ){\text{ and class}})}}{{Pr((x_{1} ,x_{2} , \ldots ,x_{n} ))}} \\ & = \frac{{Pr({\text{class}})}}{{Pr((x_{1} ,x_{2} , \ldots ,x_{n} ))}}\prod\limits_{{i = 1}}^{n} P r(x_{i} |{\text{class}}) \\ \end{aligned}$$
(12)

where n denotes the number of features in the feature space. Since the value \(Pr ((x_1, x_2, \dots , x_n))\) is a constant, the classification rule can be rewritten as:

$$\begin{aligned} Pr (\text {class}|(x_1, x_2, \dots , x_n)) \propto Pr (\text {class}) \prod _{i = 1}^{n} Pr (x_i|\text {class}) \end{aligned}$$
(13)
$$\begin{aligned} \hat{y } = \mathop {\mathrm{arg\,max}}\limits _{\text {class}} Pr (\text {class}) \prod _{i = 1}^{n} Pr (x_i|\text {class}) \end{aligned}$$
(14)

The notion of class restrictive autonomy was utilized to ensure the ease of computation, thus, tagging the Bayesian classification as naïve—nevertheless, the classifier is robust, effective, and computationally efficient. Owing to the promising performance of the NB classifier, it has been adopted to solve several real-world tasks, including spam detection, recommender systems, and sentiment analysis (social media analytics). Additionally, due to its superior performance in multi-class problems, it has been exclusively adopted to text classification tasks. It is interesting to note that Bayesian spam filters have been widely implemented by many email clients—the software that ensures the effective performance of email clients is entrenched with server-side email filters utilizing Bayesian filters. Generally, a Gaussian NB classifier is utilized to accommodate numerical features, where the likelihood of the features is assumed to be Gaussian (normally distributed):

$$\begin{aligned} Pr (x_i|\text {class}) = \frac{1}{\sqrt{2\pi \texttt {Var}(\text {class})}} \cdot \text {exp}\left( -\frac{(x_i - \mu _{\text {class}})^2}{2\text { }{} \texttt {Var}(\text {class})}\right) \end{aligned}$$
(15)

However, in this study, we employ the supervised discretization approach to discretize the continuous attributes as it overcomes the assumption of the normality of continuous features.

To facilitate the classification of UBEs using the NB classifier, we utilize the implementations in the Python sklearn.naive_bayes.GaussianNB library, as shown in Block 12.

figure p

5.2 Classification using support vector machines (SVM)

The SVM classifier is a supervised learning algorithm that solves both regression and classification problems, and is proven to superior in performance when compared to several attendant learning algorithms (Sculley and Wachman 2007). The applications of SVM include solving quadratic programming problems with inequality constraints and linear equality, by differentiating groups using hyperplanes. Despite the higher training time in comparison to several other classifiers, the SVM classifier facilitates promising results, owing to its capacity to model multi-dimensional borderlines which are neither straightforward nor sequential. Furthermore, the classifier model is not disproportionately complex, in the sense that the number of trainable parameters is lower than the number of observations, thus making SVM an ideal suit for real-world tasks like speech and handwriting recognition.

To understand the SVM classifier, let us consider the simple case of a binary classification problem, with features x and target classes \(y \in \{-1, +1\}\), where data points are linearly separable. Let us consider two support vectors (forming a street) passing through the data points lying closest to the decision surface (hyperplane), and a vector \(\bar{w}\) that is perpendicular to the median line of the street. Ultimately, we need to find support vectors that maximize the street width, thus finding the optimal decision surface. For an unknown sample \(\bar{u}\), by measuring the projection of the unknown sample on to the perpendicular vector, we can determine if the sample is a positive (\(y = +1\)) or negative (\(y = -1\)), i.e., \(\bar{w} \cdot \bar{u} \ge c\) or \(\bar{w} \cdot \bar{u} + b \ge 0\) for a positive sample. Now, for a positive training sample (\(x_+\)), we have \(\bar{w} \cdot \bar{x_+} + b \ge 1\), and likewise, for a negative training sample (\(x_-\)), we have \(\bar{w} \cdot \bar{x_-} + b \le -1\). So,

$$\begin{aligned} y^{(i)}(\bar{w} \cdot \bar{x^{(i)}} + b) - 1 \ge 0 \end{aligned}$$
(16)

where \(y^{(i)} = 1\) for positive samples (\(y = +1\)) and \(y^{(i)} = -1\) otherwise. Let \(x_+^{(c)}\) and \(x_-^{(c)}\) be the points on the support vectors, note that, \(y^{(i)}(\bar{w} \cdot x^{(c)} + b) - 1 = 0\) for \(x^{(c)} \in \{x_+^{(c)}, x_-^{(c)}\}\). Now, we can compute the street width as:

$$\begin{aligned} \text {width} = (x_+^{(c)} - x_-^{(c)}) \cdot \frac{\bar{w}}{||\bar{w}||_2} = \frac{2}{||\bar{w}||_2} \end{aligned}$$
(17)

Now, we transform the optimization problem from maximizing the street width, to:

$$\begin{aligned} \text {max}\text { }\frac{2}{||\bar{w}||_2} \text { (or) }\text {min}\text { }||\bar{w}||_2\text { (or) }\text {min}\text { }\frac{1}{2}||\bar{w}||_2^2 \end{aligned}$$
(18)

Now, using Lagrange multiplier \(\alpha _i\) (constrained to be \(\ge\) 0), we have the Lagrangian as:

$$\begin{aligned} {\mathcal{L}}(\bar{w}, b, \alpha ) = \frac{1}{2}\bar{w}^\texttt {T}\bar{w} - \sum \alpha _i [y^{(i)}(\bar{w} \cdot \bar{x^{(i)}} + b) - 1] \end{aligned}$$
(19)

Now, by differentiating with respect to \(\bar{w}\) and b, we get:

$$\begin{aligned} \frac{\partial {\mathcal{L}}}{\partial \bar{w}} = \bar{w} - \sum \alpha _i y^{(i)} \bar{x^{(i)}} = 0 \implies \bar{w} = \sum \alpha _i y^{(i)} \bar{x^{(i)}} \end{aligned}$$
(20)
$$\begin{aligned} \frac{\partial {\mathcal{L}}}{\partial b} = - \sum \alpha _i y^{(i)} = 0 \end{aligned}$$
(21)

Using Eqs. 20 and 21 in Eq. 19, we can simplify the Lagrangian as:

$$\begin{aligned} {\mathcal{L}}(\bar{w}, b, \alpha ) = \sum \alpha _i - \frac{1}{2} \sum \limits _i \sum \limits _j (\alpha _i \alpha _j) (y^{(i)} y^{(j)}) (\bar{x^{(i)}} \bar{x^{(j)}}) \end{aligned}$$
(22)

Now, using Eq. 20 in the decision rule of the unknown sample (\(\bar{u}\)) to be a positive sample, we get:

$$\begin{aligned} \sum \alpha _i y^{(i)} \bar{x^{(i)}} \cdot \bar{u} + b \ge 0 \end{aligned}$$
(23)

From Eqs. 22 and 23, we observe that the decision rule depends on the dot product of the sample vectors and the unknown vector. Now, when the data points are not linearly separable, we transform (using function \(\phi\)) the data points to a space where they are separable, i.e.,

$$\begin{aligned} {\mathcal{K}}(\bar{x^{(i)}}, \bar{x^{(j)}}) = \phi (\bar{x^{(i)}}) \cdot \phi (\bar{x^{(j)}}) \end{aligned}$$
(24)

Note that, all that we need to know is the kernel function \(\mathcal{K}\) (e.g., linear, Radial Basis Function (RBF), and sigmoid) that facilitates the transformation into the new space, rather than the transformation itself. In this study, we employ the SVM classifier with an RBF kernel and a cost factor of 32 (obtained empirically using grid search). The cost factor aims at regulating the modeling error that results when the function is fit too close to the data points.

To facilitate the classification of UBEs using the SVM classifier, we utilize the implementations in the Python sklearn.svm.SVC library, as shown in Block 13.

figure q

5.3 Ensemble classifiers

Ensemble learning is an approach of grouping several classifiers for training on the input data, intended on improving the classification performance. Researchers have advocated the assembling of various classifiers to handle UBE attacks effectively (Guerra et al. 2010). In this study, we employ six widely used ensembling approaches to facilitate UBE classification.

5.3.1 Classification using bagged decision trees (BDT)

A DT is a supervised learning approach that decomposes complex problems into a hierarchy of simpler ones. The internal nodes of a DT pave the way to the final decision rule, each time (at each level) adding to the previous decision rule, while the leaf nodes associate an output (class label) to the input features. Sometimes, DT tends to overfit the data, owing to the stringent decision rules at various levels of the tree. To cope with this issue, bootstrap-aggregated (bagged) DT aims at combining the results of several DT classifiers. This approach enhances generalizability and is hence adopted in a variety of tasks including spam detection and power system fault detection. BDT classifier is effective in mapping more than one parameter to a target variable (Netsanet et al. 2018) and hence is extremely useful in UBE classification.

To understand the process of bagging, let us consider the training set T to be \(\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots , (x^{(n)}, y^{(n)})\}\), where \(x^{(i)} \in X\) and \(y^{(i)} \in \Omega = \{l_1, l_2, \dots , l_k\}\). A classifier \(\mathcal{C}\) aims at mapping from T to a decision rule (\(\hat{f}\)), which then maps X to \(\Omega\), i.e., \(\mathcal{C}(T) = \hat{f}\) and \(\hat{f}(x) \in \Omega\). Now, a bootstrap sample \(T_b = \{x_b^{(i)}, y_b^{(i)}\}_{i=1}^{n}\) is obtained through independent draws from T, with replacement. The obtained \(T_b\) produces the decision rule \(\hat{f_b} = {\mathcal{C}}(T_b)\), and the final bootstrap-aggregated estimate \(\hat{F_{b}}\) is computed as the majority vote of all the B bootstrap predictors:

$$\begin{aligned} \hat{F_{b}} = \mathop {\mathrm{arg\,max}}\limits _{y \in \Omega } \sum \limits _{i = 1}^{B} I_{\{y = \hat{f_b}(x)\}} \end{aligned}$$
(25)

where \(I_{\{M\}}\) is the indicator of M. Intuitively, bagging serves as a variance reduction process that mimics the procedure of averaging over various training sets. In this study, we employ BDT classifier with 100 C4.5 DT estimators. Moreover, we employ the Gini impurity in the measurement of the quality of the split.

To facilitate the classification of UBEs using the BDT classifier, we utilize the implementations in the Python sklearn.ensemble.BaggingClassifier library (we used the Python sklearn.tree.DecisionTreeClassifier library to implement the DT classifier), as shown in Block 14.

figure r

5.3.2 Classification using random forest (RF)

While BDT classifier is effective in classification, the trees produced by a BDT classifier can be very similar, and thus, slowing down the learning process. The RF classifier overcomes this shortcoming by employing two sources of randomness including bagging and random input vectors. RF uses DT classifiers to facilitate prediction of the target variable. RF classifier has been shown to have better performance (low error rate) than several learners such as SVM and DT, in several classification tasks including speech analysis and UBE detection. Furthermore, RF performs well even in the cases of disproportionate data characterized by missing variables, by providing an approximation to the missing data and preserving the precision in cases where a significant amount of data is lost.

To understand the process of classification using RF, let us consider the training set T to be \(\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots , (x^{(n)}, y^{(n)})\}\), where \(x^{(i)} \in X\) (\(X \in \mathbb {R}^p\)) and \(y^{(i)} \in \Omega = \{l_1, l_2, \dots , l_k\}\). Now, a bootstrap sample \(T_b = \{x_b^{(i)}, y_b^{(i)}\}_{i=1}^{n}\) is obtained through independent draws from T, with replacement. The obtained \(T_b\) is used to generate an RF tree \(Tr _b\). At every node of \(Tr _b\), we choose m out of p features (optimal value is \(\sqrt{p}\)), select the splitting attribute among the m selected features using IG or Gini impurity. Then, we split the current node based on the chosen splitting attribute. This procedure is recursively repeated until the minimum node size \(n_{\text {min}}\) (maximum tree depth) is obtained. Ultimately, the classification is facilitated as:

$$\begin{aligned} \hat{y}(x) = \text {majority vote} \left\{ y_b(x)\right\} _{b=1}^B \end{aligned}$$
(26)

In this study, we employ the RF classifier with 100 C4.5 DT classifiers, and the nodes of the tree are expanded until all the leaf nodes contain less than two samples or until all the leaf nodes are pure. Moreover, we employ the Gini impurity in the measurement of the quality of the split. The RF classifier is implemented using the procedure in Algorithm 5.

figure s

To facilitate the classification of UBEs using the RF classifier, we utilize the implementations in the Python sklearn.ensemble.RandomForestClassifier library, as shown in Block 1.

figure t

5.3.3 Classification using extra trees (ET)

The extremely randomized trees classifier was aimed at randomizing the tree building further, in the context of numerical input attributes, where the choice of the optimal cut-point (discretization threshold) is responsible for a large proportion of the variance induced in the tree. Experiments (Geurts et al. 2006) have shown that the ET classifier is competitive with the RF classifier in terms of accuracy, and sometimes superior (especially when the data is noisy). Moreover, since the need for the optimization of discretization thresholds is removed in ET classifiers, they are computationally fast and easy to implement. The ET classifier has yielded state-of-the-art results in various high-dimensional complex problems.

The ET classifier is similar to an RF classifier in the sense that both these algorithms are based on choosing m (out of p, optimally \(\sqrt{p}\)) features at each node, to determine the split. However, unlike in an RF classifier, an ET classifier learns from the entire learning sample T (no bootstrap copying) or a sample drawn from T without replacement. More importantly, instead of choosing from the best cut-point based on the local sample as in BDT or RF, an ET classifier randomly selects the cut-point to determine the split. It is interesting to note that the algorithm is primarily reliant on the value of m, and when \(m = 1\), the resulting extra tree structure is built independently of the target class labels in the training set. From a statistical perspective, dropping the randomization through bagging leads to an advantage concerning the bias, while cut-point randomization often leads to an excellent reduction in the variance. From a functional perspective, the ET approach facilitates piece-wise multi-linear approximations as opposed to piece-wise constant approximations of RF classifiers. In this study, we employ the ET classifier with 100 C4.5 DT classifiers, and the nodes of the tree are expanded until all the leaf nodes contain less than two samples or until all the leaf nodes are pure. Moreover, we employ the Gini impurity in the measurement of the quality of the split.

To facilitate the classification of UBEs using the ET classifier, we utilize the implementations in the Python sklearn.ensemble.ExtraTreesClassifier library, as shown in Block 14.

figure u

5.3.4 Classification using AdaBoost (AB)

The adaptive boosting algorithm is a meta-estimator that combines several weak decision rules into one strong decision rule, and is shown to provide good performance even with the unsatisfactory performance of the individual weak learners. By convention, a strong learner is the one with an error rate close to zero, while a weak learner is the one with an error rate just below 0.5. AB is widely adopted, owing to the astounding performance of the algorithm in a wide variety of classification tasks, including UBE classification and text categorization. Furthermore, AB is straightforward, adaptive, fast, easy to program, and less cumbersome (due to minimal parameter tuning).

To understand the AB classifier, let us consider the simple case of a two-class problem, with training samples \(\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots , (x^{(n)}, y^{(n)})\}\), where \(x^{(i)} \in X\) and \(y^{(i)} \in \{-1, +1\}\). In each round \(t = 1, 2, \dots , T\), we compute a distribution \(D_t\) over the (n) training samples. A weak learner is utilized to compute a weak hypothesis \(h^t\), where the weak learner is aimed at generating \(h^t\) with low weighted error \(\mathcal{E}_t\) relative to \(D_t\). At every step the distribution is normalized using a factor \(Z_t\), to ensure that \(D_{t+1}\) is a distribution. The final hypothesis H(t) computes the overall majority vote (sign) of all the weak learners through a weighted combination of weak hypotheses, where each hypothesis is weighted by \(\alpha ^t\). The entire procedure for the AB algorithm is shown in Algorithm 5. Alternatively, for multi-class (more than two classes) problems, we have Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME) (Hastie et al. 2009), which implements the multi-class Bayes rule by modeling a forward stagewise additive model. A widely used variant of SAMME is the SAMME.R algorithm (R for Real), which converges faster than SAMME, and achieves a lower test error with fewer rounds. In this study, we employ the AB classifier with a C4.5 DT classifier for 100 rounds and the SAMME.R algorithm in the case of three-class classification. The procedure shown in Algorithm 5 is employed in the implementation of the AB classifier.

figure v

To facilitate the classification of UBEs using the AB classifier, we utilize the implementations in the Python sklearn.ensemble.AdaBoostClassifier library, as shown in Block 15.

figure w

5.3.5 Classification using stochastic gradient boosting (SGB)

The AB and related classifiers (step-wise algorithms) are categorized under adaptive re-weighting and combining statistical framework, where the objective is to minimize the weighted error, followed by a re-computation of the weak hypotheses. Gradient boosting machines enhance this framework further, by casting the process of boosting as a numerical optimization with an objective of loss minimization through the addition of weak learners using the steepest gradient algorithm. In the SGB approach, we add a new weak learner at a time, while the existing weak learners are left unchanged, and thus, facilitating a stage-wise additive approach. The SGB algorithm is related to both bagging and boosting, where many small trees are built sequentially from the gradient of the loss function of the previous tree (pseudo-residuals). At each round, a tree is built from a random bootstrap sample (drawn without replacement), resulting in an incremental improvement in the model. Thus, the SGB algorithm is computationally fast, resistant to outliers, and avoids over-fitting of the data, and is hence adopted in a variety of applications including microscopy image analysis and slate deposit estimation.

To understand the working of the SGB classifier, let us first understand a naïve formalization of gradient boosting. Let us consider the training set T to be \(\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots , (x^{(n)}, y^{(n)})\}\), where \(x^{(i)} \in X\) and \(y^{(i)} \in \Omega = \{l_1, l_2, \dots , l_k\}\). A classifier \(\mathcal{C}\) aims at mapping from T to a decision rule (\(\hat{f}\)), which then maps X to \(\Omega\), i.e., \(\mathcal{C}(T) = \hat{f}\) and \(\hat{f}(x) = y \in \Omega\). First, let us fit a model to T, i.e., \(\hat{f_0}(x) = y\). Now, let us fit another model \(\hat{h_0}\) to the residuals obtained, i.e., \(\hat{h_0}(x) = y - \hat{f_0}(x)\). Now, in the subsequent round, create a stage-wise additive model to correct the errors of the previous model as \(\hat{f_1}(x) = \hat{f_0}(x) + \hat{h_0}(x)\). Now, let us generalize this idea for R rounds as:

$$\begin{aligned} \hat{f_R}(x) = \hat{f_0}(x) \mapsto \hat{f_1}(x) = \hat{f_0}(x) + \hat{h_0}(x) \dots \mapsto \hat{f_R}(x) \end{aligned}$$
(27)
$$\begin{aligned} = \hat{f_{R-1}}(x) + \hat{h_{R-1}}(x) \end{aligned}$$
(28)

At each step r, we aim at finding \(\hat{h_r}(x) = y - \hat{f_r}(x)\). In practice, \(\hat{h_r}\) is almost always a tree-based classifier. Now, let us tweak the model to conform to the actual SGB classifier; since we aim at minimizing the loss function (L), let us initialize \(\hat{f}\) with the mean of the target classes in T, i.e.,

$$\begin{aligned} \hat{f_0}(x) = \mathop {\mathrm{arg\,min}}\limits _\gamma \sum \limits _{i = 0}^{n} L(y^{(i)}, \gamma ) \end{aligned}$$
(29)

Now, we recursively define each subsequent \(\hat{f_r}\) (\(r \ge 0\)) as \(\hat{f_r}(x) = \hat{f_{r-1}}(x) + \hat{h_{r-1}}(x)\), where \(\hat{h_{r-1}}(x)\) is a classifier that aims at fitting the residuals (\(\sigma _{r-1}\)) (computed as the gradient of the loss function), i.e.,

$$\begin{aligned} \sigma _{r-1} = -\frac{\partial L(y, \hat{f_{r-1}}(x))}{\partial \hat{f_{r-1}}(x)} \end{aligned}$$
(30)

The final learner obtained after R rounds (\(\hat{f_R}\)) is the trained SGB classifier. In this study, we employ a SGB learner with a C4.5 DT classifier of maximum depth two (\(\hat{h}(x)\)), trained for 100 rounds. Moreover, we employ deviance as the loss function, which measures the goodness of the fit.

To facilitate the classification of UBEs using the SGB classifier, we utilize the implementations in the Python sklearn.ensemble.GradientBoostingClassifier library, as shown in Block 18.

figure x

5.3.6 Classification using voting ensemble (VE)

A voting ensemble classifier is a naïve approach to aggregating the predictions of a variety of diverse classifiers using a majority rule. For a set of classifiers \(C^r\)s (total R classifiers) trained on the same training data (\(T = \{x^{(i)}, y^{(i)}\}_{i=1}^n\), \(y^{(i)} \in \Omega\)), we have predictions (\(y^r\)s) such that \(C^r(x) = y^r\), where \(y^r \in \Omega\). Now, the final classification is facilitated as:

$$\begin{aligned} \hat{y}(x) = \text {majority vote} \left\{ y^r\right\} _{r=1}^R \end{aligned}$$
(31)

Such voting is often referred to as the hard voting scheme. In this study, we employ a VE classifier with seven diverse classifiers including Gaussian NB, logistic regression, ID3 DT, RF, ET, AB, and SGB (with the parameters described in the above sections). Additionally, we tested the plurality voting scheme; however, the majority voting scheme outperformed the plurality voting scheme.

To facilitate the classification of UBEs using the VE classifier, we utilize the implementations in the Python sklearn.ensemble.VotingClassifier library, as shown in Block 19.

figure y

5.4 WEKA workbench for machine learning

Apart from Python programming, the Waikato Environment for Knowledge Analysis (WEKA) workbench (Hall et al. 2009) is recognized as a landmark system in machine learning and data mining, which provides a toolbox of learning algorithms, along with a framework for the development of novel algorithms without the burden of the supporting infrastructure for scheme evaluation and data manipulation.

The WEKA project aims to provide a comprehensive collection of data preprocessing and machine learning algorithms for practitioners and researchers. It facilitates easy and quick comparison of several machine learning algorithms on datasets. Furthermore, the WEKA graphical user interface enables beginners to seamlessly perform data preprocessing, regression, classification, clustering, feature selection, association rule mining, and visualization. The WEKA tool has achieved widespread acceptance in business and academia alike, and has become a widely adopted tool for the research in data mining. Table 7 tabulates the capabilities of several machine learning and feature selection approaches employed in this study, with respect to WEKA workbench.

Table 7 Capabilities of the algorithms concerning WEKA workbench

6 Performance evaluation and discussion

To evaluate the efficacy of the utilized feature selection (extraction) and machine learning algorithms in spam and phishing email detection, we performed extensive experimentation on the datasets described in Table 6. All the experiments in this study were performed on a PC with Intel Core i7 \(\times\) 2.5 GHz with 16 GB RAM in the Mac 10.14 OS. Furthermore, all the experiments were carried out through 10-fold cross-validation, and the overall performance was computed as the average across all the folds. In this section, we first discuss the evaluation metrics employed in this study and their relevance concerning UBE detection. Then, we present the results of our experimentation, followed by a discussion on the implications of the presented results.

6.1 Performance evaluation metrics

Most of the works in the existing literature employ classification accuracy as the key performance indicator (see Table 2). However, only measuring the number of correctly classified email messages is not sufficient, owing to the costs attached with the misclassification of UBEs; other metrics derived from information retrieval and decision theory (e.g., precision and recall) can help gain better insights into the obtained results. When a spam email message is misclassified as a ham email, it causes a rather insignificant problem (user only needs to delete such an email). However, when ham emails are misclassified as spam or phishing emails, there is a possibility of losing vital information (specifically in scenarios where spam emails are deleted automatically), while phishing emails that are misclassified as ham emails result in a breach of privacy (a much more serious concern). Moreover, in scenarios with imbalanced data (such as in our case), accuracy does not consider all the relevant aspects of the classification inference. In this study, we employ seven standard evaluation metrics including accuracy, precision, recall, F1-measure (F1 score), Matthews correlation coefficient (MCC) score, area under the ROC curve (AUROC), and Area Under the Precision-Recall Curve (AUPRC), to assess the performance of our extensive evaluation accurately.

Accuracy: This metric aims at evaluating the average number of correctly classified email messages over the given email corpus. The classification accuracy can be computed using:

$$\begin{aligned} \text {Accuracy} = \frac{|{\mathcal{H}} \rightarrow {\mathcal{H}}| + |{\mathcal{S}} \rightarrow {\mathcal{S}}| + |{\mathcal{P}} \rightarrow {\mathcal{P}}|}{N_{\mathcal{H}} + N_{\mathcal{S}} + N_{\mathcal{P}}} \end{aligned}$$
(32)

where \(\mathcal{M}\) denotes the email type (\(\mathcal{M} = \mathcal{H}\) for ham, \(\mathcal{M} = \mathcal{S}\) for spam, and \(\mathcal{M} = \mathcal{P}\) for phishing), and \(N_{\mathcal{M}}\) denotes the number of email messages of type \(\mathcal{M}\). Also, \(|{\mathcal{M}} \rightarrow {\mathcal{M}}'|\) denotes the number of email messages of type \(\mathcal{M}\) that are classified as \(\mathcal{M}'\). It is necessary to note that in Dataset\(_1\), \(|{\mathcal{S}} \rightarrow {\mathcal{H}}|\) (false-negative event (miss)) occurrences are inexpensive mistakes, while \(|{\mathcal{H}} \rightarrow {\mathcal{S}}|\) (false-positive event (false alarm)) is a more serious concern. However, in Dataset\(_2\), both \(|{\mathcal{H}} \rightarrow {\mathcal{P}}|\) and \(|{\mathcal{P}} \rightarrow {\mathcal{H}}|\) incur the same cost. Hence, in Dataset\(_1\), metrics that account for false positives, such as precision of UBEs, recall of ham emails, F1-measure, MCC score, AUROC, or AUPRC, serve to be more appropriate.

Precision: This metric computes the positive predictive value (reliability or worth of the UBE filter) by measuring the true positives and false positives. Precision aims at measuring the number of relevant results, i.e., what proportion of ham email identifications were actually ham in nature. For a given email type \(\mathcal{M}\), it can be computed as:

$$\begin{aligned} \text {Precision}({\mathcal{M}}) = \frac{|{\mathcal{M}} \rightarrow {\mathcal{M}}|}{|{\mathcal{M}} \rightarrow {\mathcal{M}}| + |\lnot {\mathcal{M}} \rightarrow {\mathcal{M}}|} \end{aligned}$$
(33)

The precision is computed for individual email types, and the overall precision is computed as the weighted average of the individual components as:

$$\begin{aligned} \text {Precision} = \frac{\text {Precision}({\mathcal{M}}) \cdot N_{\mathcal{M}} + \text {Precision}(\lnot {\mathcal{M}}) \cdot N_\lnot{\mathcal{M}}}{N_{\mathcal{M}} + N_\lnot{\mathcal{M}}} \end{aligned}$$
(34)

Precision (of UBEs) is more appropriate in measuring the performance of Dataset\(_1\), where false-positive events cost more than false-negative events. However, it is not very appropriate in measuring the performance of Dataset\(_2\), where both false positives and negatives incur the same cost. Hence, we need metrics that incorporate both false positives and negatives, to obtain a generalized performance metric.

Recall: This metric evaluates the sensitivity (effectiveness of the UBE filter) by measuring the number of UBE messages that the filter succeeded in preventing from reaching the email inbox of the user. For a given email type \({\mathcal{M}}\), it can be computed as:

$$\begin{aligned} \text {Recall}({\mathcal{M}}) = \frac{|{\mathcal{M}} \rightarrow {\mathcal{M}}|}{|{\mathcal{M}} \rightarrow {\mathcal{M}}| + |{\mathcal{M}} \rightarrow \lnot{\mathcal{M}}|} \end{aligned}$$
(35)

The recall is computed for individual email types, and is aggregated using Eq. 34. As discussed earlier, recall (of ham emails) is appropriate in measuring the performance of Dataset\(_1\), while in Dataset\(_2\), where false negatives are equally as important as false positives, recall is inappropriate.

F1 score: This metric seeks a balance between the precision and recall, and is interpreted as the weighted harmonic mean of the precision and recall. It differs from accuracy in the sense that, accuracy only accounts from true positives and negatives, while neglecting false positive and negatives. The F1 (\(F_{(\beta =1)}\)) score can be computed as:

$$\begin{aligned} F_{(\beta =1)} = (1 + \beta ^2) \frac{\text {Precision} \cdot \text {Recall}}{(\beta ^2 \cdot \text {Precision}) + \text {Recall}} \end{aligned}$$
(36)

Since F1-measure uses both false positives and negatives by capturing precision and recall, it serves as a generalized metric for both Dataset\(_1\) and Dataset\(_2\). However, F1-measure does not account for the true negative occurrences (e.g., \(|{\mathcal{S}} \rightarrow {\mathcal{S}}|\)).

MCC score: This metrics serves as a balanced measure even in scenarios of class imbalanced data (such as in our case) by measuring the True and False Positives and Negatives (TP, TN, FP, and FN). The MCC score computes the essence of the correlation between the predicted and the observed classifications. The MCC score can be computed as:

$$\begin{aligned} \text {MCC} = \frac{\text {TP} \cdot \text {TN} - \text {FP} \cdot \text {FN}}{\sqrt{(\text {TP} + \text {FN}) \cdot (\text {TP} + \text {FP}) \cdot (\text {TN} + \text {FN}) \cdot (\text {TN} + \text {FP})}} \end{aligned}$$
(37)

Since MCC score accounts for true and false positives and negatives, it serves as a more generalized metric than F1-measure, in evaluating the performance of the underlying machine learning approaches.

Area under the ROC curve (AUROC): The ROC probability curve is a graphical plot of sensitivity (Eq. 35) against fall-out (1-specificity, see Eq. 38). The AUROC metric measures the capability of a model to distinguish between classes. A greater value of AUROC indicates that the underlying UBE filter is able to distinguish between ham, spam, and phishing emails.

$${\text{Specificity}}({\mathcal{M}}) = \frac{{|\neg {\mathcal{M}} \to \neg {\mathcal{M}}|}}{{|\neg {\mathcal{M}} \to \neg {\mathcal{M}}| + |\neg {\mathcal{M}} \to {\mathcal{M}}|}}$$
(38)

Although AUROC effectively captures the hit and miss rates, it does not vary with the change in the ratio of the target classes, and hence is not very inferential in scenarios with imbalanced data.

Area under the precision-recall curve (AUPRC): The precision-recall curve is a graphical plot of precision (Equation 33) against the recall (Eq. 35). A higher value of AUPRC signifies that the underlying model minimizes the misclassifications and false alarms. When dealing with skewed datasets (such as in our case), the AUPRC reveals more informative insights concerning the performance of the underlying model, in comparison to AUROC (Saito and Rehmsmeier 2015).

6.2 Results and discussion

In this section, we report the results of our exhaustive experimentation on spam and phishing datasets in Table 6. Note that, Dataset\(_3\) has the maximum number of samples and classes among the obtained datasets, and is hence utilized as the representative sample subject to feature selection (extraction). The features subspace obtained using Dataset\(_3\) was then employed in Dataset\(_1\) and Dataset\(_2\), to facilitate accurate filtering of spam and phishing emails. Table 8 tabulates the performance of various machine learning algorithms (see Sect. 5) in the classification of spam emails of Dataset\(_1\) using the email features obtained using feature selection (see Sect. 4.2) of the feature space of Dataset\(_3\). Similarly, the performance of the machine learners on Dataset\(_2\), using the features extracted from Dataset\(_3\) is summarized in Table 9. It is important to point out that PCA facilitates feature extraction rather than feature selection, through a linear transformation of the input data. Table 10 shows the performance of the machine learning classifiers using PCA-transformed Dataset\(_3\). From Tables 8, 9, and 10, it is interesting to note that the RF classifier consistently outperforms all other machine learners. Such superior performance can be attributed to the ability of RF to perform well and generalize even in the cases of disproportionate data through bagging and random input vectors. Additionally, we also remark that the features selected using FI-based feature selection (using RF) on Dataset\(_3\), when classified using an RF classifier, outperforms the performance obtained using other feature selection approaches (\(98.4\%\) accuracy and \(99.8\%\) AUPRC on Dataset\(_1\), and \(99.4\%\) accuracy and \(99.9\%\) AUPRC on Dataset\(_2\), see Tables 8 and 9)—FI (using RF) measures the usefulness of the features in the construction of the RF tree, and since the RF classifier is able to learn and generalize the underlying UBE data, it is only natural that FI (using RF) accounts for the highest performance.

Fig. 2
figure 2

A dotted heatmap mapping the occurrence frequency of the features (feature space in Table 4) in the utilized feature selection methods

Table 8 Performance evaluation of various machine learning classifiers in the classification of spam emails (Dataset\(_1\)) using the email features obtained from the feature selection on Dataset\(_3\)
Table 9 Performance evaluation of various machine learning classifiers in the classification of phishing emails (Dataset\(_2\)) using the email features obtained from the feature selection on Dataset\(_3\)
Table 10 Performance evaluation of various machine learning classifiers in the classification of spam and phishing emails (Dataset\(_3\)) with and without PCA transformation

From the analysis of the features selected by the utilized feature selection techniques, it can be noted that the features such as body_html, body_forms, subject_bank, sender_numWords, url_numLinks, url_numImgLinks, url_linkText, url_maxNumPeriods, and url_nonModalHereLinks, are selected by all feature selection techniques (LowVar, HighCorr, FI, and mRMR). However, certain features such as subject_numWords, subject_numCharacters, and subject_richness are never selected. Figure 2 depicts a dotted heatmap that captures the occurrence frequency of the features (feature space in Table 4) in the utilized feature selection techniques. It is worth understanding the occurrence frequency employed in Fig. 2 uses a naïve counting scheme, and a more advanced and informed decision concerning the information of a feature can be drawn using a weighted occurrence frequency scheme that accounts for the position of a feature in the ranked feature subspace (Gangavarapu and Patil 2019). Intuitively, the weighted occurrence frequency captures the importance of a feature \(f_i\) over \(\{f_{i+1}, f_{i+2}, \dots , f_{k-i+1}\}\) in the selected k-dimensional feature subspace.

Fig. 3
figure 3

The effect of increasing dimensions on the average training time

Fig. 4
figure 4

The effect of increasing dimensions on the accuracy of the RF classification

Fig. 5
figure 5

The effect of increasing dimensions on the precision of the RF classification

Fig. 6
figure 6

The effect of increasing dimensions on the recall of the RF classification

Fig. 7
figure 7

The effect of increasing dimensions on the F1-measure of the RF classification

Fig. 8
figure 8

The effect of increasing dimensions on the MCC score of the RF classification

Fig. 9
figure 9

The effect of increasing dimensions on the AUROC of the RF classification

Fig. 10
figure 10

The effect of increasing dimensions on the AUPRC of the RF classification

Note the superior performance of various classifiers utilizing all the features in all the three datasets—this can be explained by the informative and discriminative capabilities of the chosen feature space with respect to the underlying email corpus. The effect of increasing dimensions on the classification time is shown in Fig. 3. From Fig. 3, it can be remarked that, with the increase in the dimensionality of the data, we observe an increase in the time taken to classify the email messages. It must be noted that the average build (training) time utilized in this paper (in Tables 8, 9, and 10, and in Fig. 3) is computed as the average of the runtime taken by all the eight utilized machine learning algorithms. It is worth mentioning that, the RF classifier is scalable with high-dimensional data, and several variants of the RF classifier that utilize the MapReduce algorithm further improve the scalability and efficiency of classification (Han et al. 2013). Since the RF classifier outperforms other machine learning approaches, the subsequent analysis is only presented with respect to RF classification. The effect of increasing dimensions on the classification performance with respect to various feature selection approaches is depicted from Figs. 456789 and 10. It can be remarked that the features selected using Dataset\(_3\) model the data from the Dataset\(_2\) better than that from Dataset\(_1\). From Tables 8, 9, and 10, and from Figs. 456789 and 10, we observe that PCA indicates the lowest performance, in the case of all the datasets (Dataset\(_1\) with 19 dimensions, Dataset\(_2\) with 22 dimensions, and Dataset\(_1\) with 24 dimensions). Such low performance can be attributed to the fact that PCA is an unsupervised feature extraction approach whose main objective is to maximize the variance. As explained earlier, the ‘usefulness’ and ‘relevance’ of a feature are not interchangeable, i.e., a relevant feature does not warrant usefulness and vice versa. Thus, the filters that only aim at maximizing the variance, often ignore the usefulness of the chosen features, which in turn impacts the classification performance. This fact is clearly corroborated by the lower performance of the LowVar filter in Dataset\(_1\) with 27 dimensions.

7 Summary

Feature engineering and machine learning are indispensable in building any intelligent system. In this study, we surveyed various aspects of feature engineering in spam and phishing email detection. Moreover, we detailed various attempts by the researchers in mitigating the menace of UBE emails through the use of machine learning classifiers. In general, the volume of existing literature evaluated in this study corroborates the significant progress that has been and will be made in the field of spam and phishing email detection. In this research, we employed forty informative and discriminative content-based and body-based features that were selected in accordance with the underlying email corpus. First, we elucidated on the process of extraction of the discriminative feature space from the raw email corpus. Then, we leveraged five widely used prolific feature selection (extraction) approaches to engender an optimal feature subspace to improve the classification performance and eliminate the noise in the data. We presented an exhaustive comparative study through the use of several state-of-the-art machine learning classifiers to facilitate UBE filtering and classification. Furthermore, we explained the key performance indicators vital in the accurate assessment of the performance of the underlying UBE filters. We observed that the feature subspace determined by the FI-based feature selection approach (using RF), when classified using an RF classifier, resulted in an overall accuracy of \(98.4\%\) on ham−spam dataset (AUPRC of \(99.8\%\)) and \(99.4\%\) on ham−phishing dataset (AUPRC of \(99.9\%\)). Additionally, to enhance the understanding of the readers, we presented snippets of Python code, in-line with the text, enabling them to avail benefits from the existing email data.

Despite the extensive research in the field of UBE detection and filtering, certain issues need to be addressed. These issues include the lack of an effective strategy to handle security attacks on UBE filters, the inability of the current UBE filters to tackle concept drift phenomenon, and lack of effective UBE filters that utilize graphical features. In the future, we aim at improving the effectiveness of the proposed approaches by addressing the aforementioned open issues. Additionally, we also aim at exploring adversarial learning approaches to learn and adapt to the concept drifts effectively.