Keywords

1 Introduction

In recent decades, the use of technology and the Internet has reached to peak. Being fast, cheap, and accessible, the extension of the use of email has increased tremendously. This resulted in a dramatic increase in spam emails [1]. These emails are junk emails that are almost identical and sent to multiple recipients randomly [2]. The changing way of communication by Internet on a very large scale has led to the expansion of new communication services, such as email [3]. According to a recent study, over 4 billion of the population use email. Due to its simplicity and accessibility, the mark of people using email is increasing day by day. It is extremely fast and cost-effective. With the escalation in the broadening of emails, there is also a rise in spam emails, and the unnecessary and undesirable bulk mails sent to several users haphazardly. Spam mails not only cause the problem of electronic storing space but also are the carrier of malware and hoard the network bandwidth, space, and computational power [4]. A study estimates that approximate measure of spam emails is 85%.

While the number of spam emails increases, the certainty of a user not reading a non-spam email increases. Due to the loss of network bandwidth and time consumed by users to demarcate between normal and spam [5], various spam filtering techniques have been introduced. These techniques can be categorized based on the use and non-use of machine learning algorithms. The use of ML algorithms provides an automated approach where the model trains itself based on features extracted from the dataset. As easy to implement and short training time, Naïve Bayes is a popular spam filter [6]. The main objective is to collate the accuracy of four major classification systems that include SVM, random forest classifier, Naïve Bayes, and logistic regression and select the best model for spam detection.

2 Literature Survey

Spam: Unnecessary emails sent by unknown people randomly in bulk are spam mails. These spam mails are vulnerable to major user security and also cause the problem for electronic storing space. The following are the major spam categories (Table 1).

Table 1 Frequency of major scam categories and danger level caused by them

Spam classification: Email systems without spam classification techniques are highly open to risks. The dangers open to email systems without spam filtering are spyware, phishing, ransomware [7]. Thus, the classification of such messages can be seen as another defense mechanism against such dangers. In the previous years, various techniques of spam identification have been developed. Domain name server blacklist (DNSBL) and white list, high-volume spammers (HVSs) and low-volume spammers (LVSs) classification, machine learning-based Web spam classification, support vector machine classifier model, TruSMS systems, cloud-based approach [3], and ML algorithms like Naïve Bayes, random forest classifier, neural networks [8] are some of the classification techniques developed by researchers earlier (Table 2).

Table 2 Some previously used techniques in spam filtering and their accuracy [912]

2.1 Existing Approaches

Global email users are increasing day by day. In 2024, it is set to grow up to 4.48 billion [13]. As the use of email increases, spam increases too. This causes a decrement in productivity since manually spam filtering is time-consuming, and also the electronic storing space is reduced. Spam also increases the cyber threat to users through various phishing and malware attacks. Not only this, it has been discovered that on yearly basis, spam is accountable for over 77% of whole global email traffic [13].

In today, two common approaches, namely knowledge engineering and machine learning, are used for spam filtering. A collection of rules in knowledge engineering are used to identify mails as ham or spam. This method can lead to large time wastage and also does not guarantee the results as there is a continued need for an update in the specified set of rules. Thus, it is mainly used by naïve users [14].

Machine learning is completely based on the datasets. It just needs the training datasets, and the algorithm used itself learns the classification rules from the set of training samples from datasets. Thus, machine learning is proved to be more effective than knowledge engineering [14]. Examples of machine algorithms used for the classification of spam include Naïve Bayes, support vector machine, artificial immune systems, neural networks, logistic regression, deep learning, and many more.

The best possible outcome for any algorithm can be checked using various evaluation techniques in machine learning. This evaluation technique also helps in recognizing the overfitting and underfitting of the model. Cross-validation score, F1-score, confusion matrix, precision, recall, accuracy, regression metrics, and mean squared error can be used for evaluating the model. The three major metrics to weigh up a classification model are accuracy, precision, and recall [15].

Three major methods that are reliable for present spam detection systems are linguistic-based (used in places like a search engine), behavior-based (user-dependent since the need of change in rules from time to time), and graph-based (detect abnormal forms in data showing the behavior of spammers) [16].

3 Proposed Method

3.1 Proposed Algorithm and Workflow

Two datasets are used for this experiment to select the best algorithm with the highest accuracy. Dataset 1 is taken from Kaggle SMS Spam Collection [17]. This dataset contains 5574 messages tagged ham or spam. Dataset 2 is taken from the collection of emails from_Apache SpamAssassin’s public datasets_ available on Kaggle as spam or not spam dataset [18]. There are 2500 non-spam and 500 spam emails in this dataset. The experiment is performed using four simple machine learning classification algorithms that are logistic regression, support vector machine (SVM), random forest classifier, and logistic regression on a prepared feature set of two datasets.

Through evaluation using confusion matrix, evaluation metrics, k cross-validation score, and accuracy, the perfect model with the highest accuracy and reduced underfitting or overfitting is selected. Selection of parameter k in k cross-validation score and splitting ratio of datasets play an efficient contribution in assessing the accuracy of the model. The accuracy and overfitting/underfitting results are visualized using a heat map (Fig. 1).

Fig. 1
figure 1

Flowchart of model

Data Preprocessing

Both the datasets are taken from Kaggle [17, 18]. Dataset 1 comprises 5574 messages tagged according to being ham or spam. Here, we need to label the spam messages as 0 and ham messages as 1 for further simplicity. Dataset 2 comprises a collection of 3000 emails taken from “_Apache SpamAssassin’s public datasets_”. There are 2500 non-spam emails and 500 spam emails in this dataset. Here, the dataset initially contains the labeled data, that is, spam mails as 1 and ham emails as 0. All the null values in both datasets are converted to null strings for the normalization of plain text (Fig. 2).

Fig. 2
figure 2

Visualization of labeled dataset 1 (left) and dataset 2 (right)

Feature Extraction

The feature set will be prepared using term frequency-inverse document frequency (TF-IDF) vectorizer by transforming the feature text into feature vectors and converting it to lowercase. Parameter min_df is set to 1 that means to ignore the terms that appear in less than one document. The terms that appear irregular, min_df is used to remove them [19]. The next parameter, stopwords, is set to English to return the relevant stop list. The parameter lowercase is set to true to convert all characters to lowercase.

Pipeline

To automate the workflow of producing a machine learning model and evaluation of spam detection using different algorithms, a pipeline is created. The different algorithms used in this experiment are as follows:

Logistic Regression

A supervised machine learning algorithm is used for solving classification problems. It is a simple yet very effective algorithm for binary classification. The basis of this algorithm is the logistic function (sigmoid function), which takes any real-valued number and maps it in the value between 0 and 1 [20].

$$\begin{aligned} & {\text{Logistic}}\;{\text{Function:}}\;y = 1/(1 + e^{ - x} ) \\ & \quad \quad \quad \quad \quad \quad \quad {\text{i.e.,}}\;1 + e^{ - x} = 1/y \\ & \quad \quad \quad \quad \quad \quad \quad \quad e^{ - x} = (1{-}y)/y \\ & \quad \quad \quad \quad \quad \quad \quad \quad e^{x} = y/(1 \, {-} \, y) \\ & \quad \quad \quad \quad \quad \quad \quad \quad x = \log (y/(1 \, {-} \, y)) \\ \end{aligned}$$

Naïve Bayes

A simple probabilistic classifier uses the Bayes theorem that calculates a set of probabilities by counting the frequency and combination of values in the dataset [21].

$$P(A\left| {B) = P(B} \right|A) \, P(A)/P(B)$$

Using Bayesian probability terminology, the above equation can be written as [22]

$${\text{Posterior}} = {\text{Prior}}*{\text{Likelihood}}/{\text{Evidence}}$$

Random Forest Classifier

It uses ensemble learning and regression technique to solve data classification problems [23]. It is a supervised machine learning algorithm that gets a prediction from each decision tree created.

Support Vector Machine

SVM is a supervised machine learning algorithm that classifies the data points by finding an optimal hyperplane. There are support vectors that help to maximize the classifier margin.

3.2 Performance Evaluation Criteria for Algorithm

This section is to measure and analyze the accuracy of different algorithms used in the model to estimate the results that fit best between the model and testing dataset. In this experiment, we have computed confusion matrix, evaluation metrics, and k cross-validation to assess our model and different algorithms.

Confusion Matrix

The table having four outcomes computed by the binary classifier is called confusion matrix. Measures, such as error rate, accuracy, specificity, sensitivity, and precision, are derived from the confusion matrix [24]. The four outcomes mentioned above are true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Accuracy, recall, precision, and F-score are calculated using these four outcomes. Here, in this experiment, we have considered accuracy, recall, precision, F-score, and error rate to evaluate the models.

$${\text{Accuracy}} = ({\text{TP}} + {\text{TN}})/({\text{Total}}\;{\text{no}}{.}\;{\text{of}}\;{\text{dataset}}\;{\text{samples}})$$

Sensitivity is also known as recall or true positive rate. It is used to measure the ability of a test to be positive when the condition is present [25].

$$\begin{aligned} {\text{Sensitivity}}\;{\text{or}}\;{\text{Recall}} & = {\text{TP}}/({\text{TP}} + {\text{FN}}) \\ & = {\text{TP}}/({\text{Total}}\;{\text{positive}}) \\ \end{aligned}$$

Precision is also known as positive predictive value [25]. The value ranges from 0 to 1.

$${\text{Precision}} = {\text{TP}}/({\text{TP}} + {\text{FP}})$$

F-score is calculated with precision and recall, as follows:

$${\text{F-score}} = ({2}*{\text{precision}}*{\text{recall}})/({\text{precision}} + {\text{recall}})$$

K Cross-validation Score

Cross-validation is a data resampling method to assess the generalization ability of predictive models, and cross-validation is a resampling (in such a way that no two samples overlap) method to assess the abstraction ability of models to predict the outcomes and stave off the overfitting [25]. The parameter k is the number of sets in which the sample is to be split, such that no set contains element in common. In this experiment, value of k is 4. cv_score_mean and cv_score_std are calculated to verify the accuracy results and find deviation in cv_score, respectively.

4 Results

Accuracy and precision are the important parameters in the above experiment to evaluate the different algorithms used. Other functions such as F1-score, error rate, and recall are also calculated to compare the performance of four algorithms on the above-mentioned two datasets. As shown in Fig. 3, the accuracy for random forest classifier is highest followed by the logistic regression classification algorithm. F1-score and precision of random forest classifier outperform all the other algorithms in the experiment. Figures 5 and 6 depict the evaluation results of all four algorithms on both datasets, respectively (Fig. 4).

Fig. 3
figure 3

Accuracy of dataset 1 (left) and dataset 2 (right)

Fig. 4
figure 4

F1-score of dataset 1 (left) and dataset 2 (right)

Fig. 5
figure 5

Evaluation of dataset 1

Fig. 6
figure 6

Evaluation of dataset 2

The error rate in identifying whether the mail is spam or ham is lowest for random forest classifiers and highest for Naïve Bayes. Cv_score_std for Naïve Bayes in dataset 1 (0.0027) and logistic regression in dataset 2 (0.0066) is the lowest out of the four algorithms. Lower the cv_score_std, lower is the overfitting. However, cv_score_mean that verifies the accuracy results is highest for random forest, 0.9757 and 0.9673 in dataset 1 and 2, respectively.

Among all the models used, random forest has the highest accuracy, that is, 0.9767 and 0.9717 in both datasets, respectively.

5 Conclusion and Future Work

In the comparative analysis of machine learning algorithms to classify emails as spam or ham using two different datasets, the random forest classifier is the best binary classifier out of all the four supervised algorithms. Feature extraction is done using the TF-IDF vectorizer, and the application of pipeline automates the workflow of training and evaluating the model using four different classification algorithms and different evaluation methods. Here, two different datasets are used to analyze the results on different data to select the model with high accuracy and less error rate.

The future work includes assessing the model with various effective algorithms to automate the task of filtering spam and non-spam emails using different features. This research proposes to test the model using different feature sets on different types of datasets to analyze and increase the efficiency of the prototype to identify the email as spam or non-spam.