1 Introduction

Internet users are exposed to several threats including personal and financial information theft, damage to sensitive information stored on a computer, ransom demands, unauthorized online purchases, etc. The users are prone to these and similar other threats where the attacker uses computer viruses, spam messages to obtain the user’s private information, ransomware, and similar other tools. Spam messages often contained in e-mails have become a frequent tool for stealing users’ information. Aiming at stealing financial information, such e-mails contain malware files, invitations, and uniform resource locator (URL) links that lead to various malware-hosting and phishing websites. Over the past few years, spam emails have been increased substantially, as reported in [2] which indicates that the phishing e-mails for 1st, 2nd and 3rd quarters of 2020 are 118260, 132553, and 139685, respectively.

Spamming indicates any fraudulent activity that targets the financial and personal information of internet users and involves social engineering and similar other concepts. Spam emails are designed to show that they are from genuine and registered companies which the user may be using. The idea is to lure the user to click the provided link for further information or verification. Once the user clicks the provided link, the information can be gathered by the attacker. Such attacks can be detected using available programs and models to a certain extent, yet, change in the design and strategy of such attacks makes the detection more difficult and complex for the available whitelist or blacklist-based techniques. The classification accuracy of such techniques is reduced over time if they are not updated [13] as the strategy and structure of such attacks have evolved. Similarly, a large number of auto-generated emails makes it a time-consuming process and further increases the complexity of spam email detection. Research indicates that of the 205 billion emails sent every day, approximately 22.8% are unnecessary and 18.5% are irrelevant [19].

Automated systems are developed by the researchers for different purposes such as spam email detection [17, 30], health care systems [6,7,8], anomaly detection [1, 14], etc. This study also contributes to spam email detection using machine learning techniques. Electronic mail (e-mail) has become the most common source for spammers to steal sensitive information [10] and developing an automatic system to detect spam email is very important to safeguard individuals and companies alike. Despite the availability of several spam detection techniques, the provided accuracy is not up to the standard. Furthermore, predominantly these techniques require longer training time and the false positive rate is high. Devising an approach that can detect spam emails before they are opened is critical. The available solutions do not possess this capability despite being sophisticated and adaptive. This research aims to solve this problem by employing machine learning algorithms and various feature extraction techniques. This study introduces an approach for spam e-mail detection and makes the following contributions

  • This study proposes an approach for spam e-mail detection using features from textual data. Two important feature extraction techniques are investigated in this regard.

  • Besides using term frequency-inverse document frequency (TF-IDF) and bag of words (BoW), an intuitive feature extraction approach, feature union, is introduced that combines TF-IDF and BoW to make an effective feature set.

  • To resolve data imbalance, experiments are performed with under-sampling, and results are analyzed to investigate the impact of undersampling on the performance of the machine and deep learning models.

  • Several machine learning models are employed for this purpose including random forest (RF), gradient boosting machine (GBM), support vector machines (SVM), Gaussian naïve Bayes (GNB), and logistic regression (LR). The performance of machine learning models is enhanced by optimizing several hyperparameters. Deep learning models such as long short term memory (LSTM) and gated recurrent unit (GRU) are also adopted for spam email detection.

  • Extensive experiments are carried out and performance is evaluated using accuracy, precision, recall, F1 score, and micro average. In addition, the performance is compared with several state-of-the-art models.

The rest of the paper is organized as follows. Section 2 discusses important research works related to the current study. The proposed approach, dataset used for experiments, machine learning models, and sampling approaches are given in Section 3. Section 4 provides results and discussions while the conclusion is given in Section 5.

2 Related work

Spam emails contain advertising messages, as well as, URLs and file attachments for stealing the personal and financial information of the users. The advertising emails are considered to be legal as long as the content is not fraudulent; these can be considered spam only if the emails contain any unsolicited content [15]. Spammers frequently work to discover techniques to make a spam email look legitimate to dodge email filters. One of the major problems is that spam has different forms that can be considered a legitimate message [10]. Due to the importance of spam detection, a large number of research works can be found in the literature. Both machine learning and deep learning approaches have been adopted for spam classification. To entertain the required need, some spam-related studies have been discussed in this section.

Various machine learning techniques based on spam detection have been used by researchers. Specific keyword pattern in emails for spam detection is used in most of the existing statistical models. For example, [9] explored the major characteristics of spam by reviewing the content-based spam detection techniques. Both statistical and non-statistical methods are used for spam detection, however, the statistical approaches appear to be more effective. At first, the SMS spam collection dataset is collected for training and classification. Later, classification is done using the decision tree (DT), LR, and k-nearest neighbor (KNN). Results show that LR outperforms with the highest accuracy of 99%. Francisco et al. [17] proposed hierarchical clustering and a combination of supervised learning for spam detection. The clustering algorithm is used to generate SPEMC-11K (Spam Email Classification) and emails are categorized into multiple classes. The obtained dataset consists of three distinct classes including health and technology, sexual content, and personal scams. Moreover, various combinations of TF-IDF and bag of words (BoW) feature embedding are applied. Spam emails are classified through SVM, LR, and Naïve Bayes. Results indicate that the NB with TF-IDF has the best classification speed and SVM combined with TF-IDF outperforms all the other combinations with the highest accuracy of 95.39%.

The study [5] used the spam base UCI dataset for spam classification using ten state-of-the-art classifiers. Similarly, infinite latent feature selection (ILFS) is employed to select the most relevant features from the dataset. 10-fold cross-validation is used for SVM, radial bases function (RBF), decision table (DT), Bayes net (BN), KNN, NB, random tree (RT), LR, ANN, and RF. RF tends to show superior performance by achieving 95.45% accuracy. The authors propose a framework that uses S-Cuckoo and hybrid kernel-based SVM for email spam classification in [24]. Both text and image features are extracted from emails where TF features are used for text data, and Correlograms and wavelet moments for image data. The HKSVM model is designed by combining three different kernel functions to form a hybrid function that achieves an accuracy score of 95%. A comparative study based on data mining techniques used Fisher filtering (FF), Relief-F, stepwise discriminant analysis (StepDisc), and runs filtering techniques for feature selection [23]. The classifiers including random Tree, LDA, MLP, NB, KNN, SVM, and LR-Trials are applied for spam classification. The combination that outperformed all the employed methods is RF Tree which achieved 99% accuracy when applied with the FF technique.

An NB approach for spam classification is performed in [30] where NB has been applied on two different datasets UCI spam base dataset and Spam data. The UCI spam base dataset is used to train the model while the performance is tested on the Spam data. Results show that the number of instances of the dataset and the type of email has an impact on the performance of NB and the classifier achieves an accuracy score of 91.13%. The study [36] pursued various machine learning methods to make a hybrid model for enhancing spam classification accuracy. Feature selection has been performed by information gain, Chi-square, and gain ratio methods. The hybrid classifier uses a stacking method and builds a Meta learner to make the prediction-based Meta classifier. The applied hybrid classifier involves various combinations of sequential minimal optimization, SVM, NB, and J48 from decision tree algorithms. The best accuracy score of 93.22% is achieved by using J48 and NB with J48 as the Meta classifier.

The use of artificial neural networks (ANN) is reported to show better performance than traditional machine learning models in [12]. ANN is used with backpropagation (BP) and the combination of backpropagation with momentum (BP+M) on the UCI spam base dataset [33]. The BP+M optimized ANN shows better performance with an accuracy score of 95.38% with less training time. The authors utilize a feature-centric spam email detection model (FSEDM) with novel and existing features in [35]. Several sets of features are used including user-based, content, semantic, sentiment, and spam lexicons. Sentiment features are used along with the proposed features to perform the classification. The feature selection is performed through information gain, Relief-F, and gain ratio methods. For classification, SVM, bagging, RF, AdaBoost, DNN, J48, and MLP have been used where DNN shows the best performance with an accuracy of 97.2% when applied with sentiment features.

Similarly, the study [32] focuses on using a convolutional neural network (CNN) approach for spam classification. For email classification containing both text and image data, a hybrid multimodal architecture is proposed containing one CNN each for text and image. GloVe word embedding is used for multi-modal feature fusion, and a multi-modal learned rule is proposed for spam detection. The achieved accuracy is 98.11% using the Enron spam dataset. An ANN is used with radial basis function neural networks (RBFNN) to classify spam e-mails in [4]. The approach combines particle swarm optimization (PSO) algorithm with RBFNN for spam detection. The PSO algorithm is used to optimize the appropriate position c for the applied model, the singular value decomposition algorithm is used to optimize weights w and the radii r is optimized by using KNN. Experiments conducted on the UCI spam base dataset show 91.4% accuracy.

Despite the availability of sophisticated spam detection approaches, the provided accuracy is not up to the standard. In addition, existing approaches are not adaptable and robust. A comprehensive summary of the discussed research work is presented in Table 1. This research aims to fill this gap by introducing an effective approach for spam classification with high accuracy.

Table 1 Review of the discussed research works

3 Materials and methods

The proposed methodology and its working mechanism are discussed here comprising dataset description, preprocessing steps followed for noise removal, feature extraction approaches, and a brief description of machine learning models used in this study.

3.1 Proposed methodology

Figure 1 shows the flow of the adopted methodology for ham and spam email classification. Machine learning techniques are used to solve the e-mail classification problem. The proposed approach involves data collection, data preprocessing, feature extraction, model training, and model evaluation techniques. Following this approach, data is first collected and cleaned using a sequence of preprocessing steps. First, numbers and punctuation is removed, followed by case conversion and stemming. In the end, stop words are removed to clean the data. This process ensures feature space reduction and improves the learning process of the machine learning models. Later feature extraction techniques are applied to extract the features from the cleaned data. Finally, the machine learning models are trained on these extracted features, and the test data is used to evaluate the trained models. New data is fed to the trained models to classify as spam and ham e-mails.

Fig. 1
figure 1

Flow of the proposed methodology

3.1.1 Dataset description

This study considers two datasets to conduct experiments for spam email detection. Owing to the need for a large dataset for experiments, both datasets are combined into a single dataset. Both datasets are obtained from the Kaggle, although the sources are different; Dataset 1 ‘Spam or Ham - EMP Week 2 ML HW Dataset is acquired from Kaggle [22] and Dataset 2 ‘Spam filter is also acquired from Kaggle [34]. Both datasets contain two classes, one for ‘Spam’ and the other for ‘Ham’ emails. The distribution of the number of records of each dataset is provided in Table 2 and a few samples from both datasets are shown in Table 3.

Table 2 Number of records in datasets
Table 3 Sample from both datasets

3.1.2 Preprocessing

This study uses several preprocessing steps such as number removal, punctuation, and stopwords removal, conversion to lower case, stemming, and lemmatization. Preprocessing is important in this study because emails contain a lot of unnecessary raw text that can influence the models’ performance, so the removal of raw data will help to reduce the complexity of the feature set.

  • Punctuation & number removal: Emails contain punctuation and numbers which are not useful features for model training; so removing them helps to reduce complexity in features. Regular expressions are used in this step.

  • Convert to lowercase: This step is very important to reduce redundancy in the feature set as the email contains words in upper and lower cases, such as ‘hello’, and ‘Hello’. Such words are written following the language rules and they are the same for a human reader. However, feature extraction methods consider them two different words and treat them separately which increases the size of the feature space. So converting all text to lowercase will reduce the complexity of feature space and help to improve the performance of machine learning models. Case conversion is performed using the natural language tool kit (NLTK) library of Python.

  • Stop words removal: Text contains several stop words such as ‘a’, ‘the’, ‘an’, ‘is’, and ’are’, etc., which are used to clarify the meaning of a sentence. However, stop words are not important for the training of the machine learning models. Instead, they increase the complexity of the feature vector, and the models’ performance is affected. So, stop words are removed to elevate the performance.

  • Stemming and lemmatization: Both techniques are used to get the basic/root form of words as several variations of a word may be used in sentences, such as ‘gone’, ‘going’, and ‘goes’. Although, these are the extended form of the word, ‘go’, during the feature extraction, such words are treated as unique words, and their features are extracted separately which increases the feature vector complexity. Stemming and lemmatization are used to transform the extended form of the words to their root form. Stemming simply removes the ‘s’ or ‘es’ at the end of words and causes spelling mistakes or wrong words. Lemmatization is more appropriate as it considers the context in which a word is used and changes it into the proper base form. This study uses the NLTK Porter Stemmer and Word Net Lemmatizer libraries for experiments.

3.1.3 Feature union

The dataset which is used to train machine learning algorithms is small. That is the reason the feature set is also small which makes model training inefficient. To resolve this problem, we propose a feature union approach in which we combine two features to generate a large feature set which helps to improve the model performance. Features union is the combination of TF-IDF features. The BoW is a simple term count technique that often produces good results when the dataset is large and complex. TF-IDF is a weighted feature extraction technique that computes the weight of each term in the corpus. TF-IDF can be a good choice for models that require a large feature set. Feature union combines both TF-IDF and BoW features and gives a large feature set which can be good for machine learning models, especially when only a small dataset is available.

$$ TF-IDF= tf_{t,d} * log \left( \frac{N}{D_{t}}\right) $$
(1)

where tft,d is the frequency of term t in document d and N is number of documents while Dt is number documents that contain term t.

TF counts the number of occurrences of each unique term in a given document, resulting in higher values of more common terms. IDF, on the other hand, considers rare terms more important and assigns higher weights to those terms which appear less often. For feature union, TF-IDF and BoW features are considered as follows

$$ TF-IDF_{features}= \begin{pmatrix} TFIDF_{11} & TFIDF_{12} & . . . & TFIDF_{1q} \\ \\ TFIDF_{21} & TFIDF_{22} & ... & TFIDF_{2q} \\ . & . & & .\\ . & . & & .\\ . & . & & .\\ TFIDF_{p1} & TFIDF_{p2} & ... & TFIDF_{pxq} \\ \end{pmatrix} $$
(2)
$$ BoW_{features}= \begin{pmatrix} BoW_{11} & BoW_{12} & . . . &BoW_{1n} \\ \\ BoW_{21} & BoW_{22} & ... &BoW_{2n} \\ . & . & & .\\ . & . & & .\\ . & . & & .\\ BoW_{m1} & BoW_{m2} & ... &BoW_{mxn} \\ \end{pmatrix} $$
(3)

The combination of weighted and simple term count features can improve the performance of learning models. The mathematical representation of feature union is shown in (4).

$$ Feature~Union= \begin{pmatrix} BoW_{11} & BoW_{12} & . . . &BoW_{1n} & TFIDF_{11} & TFIDF_{12} & . . . & TFIDF_{1q} \\ \\ BoW_{21} & BoW_{22} & ... &BoW_{2n} & TFIDF_{21} & TFIDF_{22} & ... & TFIDF_{2q} \\ . & . & & . & . & . & & .\\ . & . & & . & . & . & & .\\ . & . & & . & . & . & & .\\ BoW_{m1} & BoW_{m2} & ... &BoW_{mxn} & TFIDF_{p1} & TFIDF_{p2} & ... & TFIDF_{pxq}\\ \end{pmatrix}_{ixj} $$
(4)

where, (3) and (2) show TF-IDF and BoW matrix, and (4) shows the feature union which is combination of BoW and TF-IDF. In (4), m = p = i and n + q = j. The illustration of feature union shown in Fig. 2.

Fig. 2
figure 2

Schematic diagram for feature fusion

3.1.4 Under-sampling approach

This study performs under-sampling to mitigate the influence of model over-fitting. For under-sampling, random under-sampling is used where the extracted data from the majority class is made almost equal to the minority class. The majority and minority classes for this study are ‘ham’ and ‘spam’, respectively. For obtaining a more balanced data distribution, data re-sampling has been a useful strategy and is adopted by many researchers. The random under-sampling approach randomly discards the samples in the training data from the majority class, ‘am’, until a balanced distribution of majority and minority class is reached. The ratio of data after re-sampling is shown in Table 4.

Table 4 Number of records in datasets

3.1.5 Supervised machine learning models

Several machine learning models are used in this study for the classification of ‘spam’ and ‘ham’ emails such as GBM, SVM, GNB, RF, and LR. Each model is implemented with BoW, TF-IDF, and the derived fused feature approach. Models are optimized to obtain better performance using a set of hyperparameters as given in Table 5. For clarification and completeness, a short description of machine learning models is provided in Table 6.

Table 5 List of hyperparameters and their used values for experiments
Table 6 Brief description of machine learning models used in this study

For a better performance appraisal, this study also adopts several deep learning models to classify emails into ham and spam. For this purpose, LSTM and GRU models are used with the best architectures. Both are recurrent neural networks application and work better on text data [27].

The architecture of both models is shown in Table 7. Both models take inputs through the embedding layer which consists of three parameters. One vocabulary size which is 5000 in our case, the second output dimension is 100, and the third is the input length. Vocabulary size defines how much bigger value can be the input for learning models [26]. Both models consist of dropout layers which help to reduce the complexity of models by drooping the neurons from models randomly [25]. LSTM and GRU both are used with 100 units. In the end, models are compiled with the ‘binary_crossentropy’ loss function because of the binary classification problem and ‘Adam’ optimizer [29]. The batch size is set to 32, while the models are fitted using 100 epochs.

Table 7 Architectures of LSTM and GRU models

4 Results and discussions

This section contains the results of machine learning and deep learning models for spam email classification. The performance is evaluated in terms of accuracy, precision, recall, and F1 score.

4.1 Results of machine learning models without re-sampling

Machine learning models are implemented using BoW, TF-IDF, and feature union techniques separately on the original dataset. The original dataset is imbalanced so the models may experience overfitting with respect to the majority class. In that case, accuracy is not a preferable metric for performance evaluation, instead, an F1 score is used.

The results of machine learning models using TF-IDF features are shown in Table 8. The performance of SVM and LR is significant in terms of F1 scores as they achieve 0.95 and 0.94 F1 scores, respectively. In terms of accuracy, SVM is best with a 0.983 accuracy score. The significant performance of SVM and LR is because of the sparse feature set because text data generates a large feature set which is good for SVM and LR. The accuracy score and F1 score have high variation as RF achieves 0.973 accuracy but the F1 score is 0.92 which shows the impact of the data imbalance. RF and GBM show poor performance with respect to SVM and LR when the F1 score is considered while GNB has the worst performance with a 0.61 F1 score.

Table 8 Results of machine learning models using TF-IDF features

Experimental results using BoW features are provided in Table 9. Models show almost similar performance with BoW features. SVM is still the best performer with a 0.94 F1 score which is 1% low as compared to its score with TF-IDF features. The primary rationale for that is the difference in feature vector; TF-IDF provides weighted features as compared to BoW which gives only term counts. The BoW can be more suitable for tree-based models and probability-based models because of rule-based prediction. In the same fashion, linear models perform better when used with weighted features. GNB is still the worst performer with a 0.62 F1 score.

Table 9 Results of machine learning models using BoW features

Table 10 shows the results of machine learning models when trained on feature union. The performance of all models has been improved with feature union as compared to BoW and TF-IDF alone. SVM, LR, and GBM achieve a 0.95 F1 score while RF improves its F1 score to 0.94. GNB achieves the highest F1 score on the original dataset with feature union which is 0.63. This significant performance of machine learning with feature union is because of the large feature set. Feature union generates a feature combination of weighted and simple terms which can be good for both linear and tree-based models. Thus, feature union leads to improved performance.

Table 10 Results of machine learning models using Feature Union

4.2 Performance of machine learning models with data under-sampling

Data under-sampling is performed to obtain a more balanced distribution of the training data for both classes so that the influence of the model over-fitting can be alleviated. Under-sampling is carried out until the number of samples of the majority class ‘ham’ become almost equal to the number of samples of the minority class ‘spam’.

Table 11 contains the results of machine learning models using TF-IDF features from the under-sampled data. Results suggest that the performance of the models is improved significantly. SVM achieves the highest accuracy score of 0.989 with the highest 0.99 F1 score. LR and RF follow this performance with 0.983 accuracy each and 0.98 F1 score, respectively. Results also indicate a high degree of agreement between the accuracy and F1 score and large deviations are not found between the accuracy and F1 score. GNB also improves its performance and achieves the highest accuracy and F1 scores when used with the under-sampled data.

Table 11 Results of machine learning models using TF-IDF features and under-sampling approach

Performance of models using the BoW features after under-sampling is performed and results are given in Table 12. LR is significantly better with 0.989 accuracy and an F1 score of 0.99. RF is just behind the LR with a 0.98 accuracy score while the F1 score is 0.98. Similarly, the performance of tree-based models and probability-based models are good with simple term count features than linear models, as the accuracy of SVM is reduced using BoW features as compared to TF-IDF features.

Table 12 Results of machine learning models using BoW featuresand under-sampling approach

Models’ results using feature union are shown in Table 13. The performance of machine learning models improved after under-sampling and feature union. LR and RF achieve the highest accuracy and F1 scores of 0.99 each. Here, both linear and tree-based models perform well based on the feature set that contains both weighted and simple term counts. These results show the impact of data balancing and feature union to improve the models’ performance.

Table 13 Results of machine learning models using features union and under-sampling approach

4.3 Classification using deep learning models LSTM and GRU

Deep learning models are also implemented using the original data, as well as, the balanced data using the random under-sampling. Results of deep learning models are shown in Table 14 which indicates that deep learning models also show good performance for spam email classification. LSTM and GRU perform well on the imbalanced dataset as GRU and LSTM achieve the highest F1 scores of the study on the imbalanced dataset which are 0.97 and 0.96, respectively. The highest F1 score on the imbalanced dataset by machine learning models is 0.95 which is achieved by LR, RF, and SVM. Overall the performance of machine learning models is good as compared to deep learning models. Deep learning models are data-intensive and require large datasets to show better performance. Given the size of the dataset for the current study, machine learning models tend to show better performance. The highest accuracy of 0.991 is achieved by machine learning model RF using feature union from the under-sampled data, while for deep learning models LTSM and GRU both achieve an accuracy of 0.98.

Table 14 LSTM and GRU results using each re-sampling technique

4.4 Computational complexity of models

Table 15 shows the computational complexity of each model using the hybrid feature set and other individual features. Models require low execution time using BoW and TF-IDF, however, do not provide higher classification accuracy. Execution time is higher when models are trained using BoW and TF-IDF features combined, as the size of the feature set increases when both features are combined. The computation time of best performer RF and LR is increased from 21.95 seconds to 96.28 seconds and 0.423 to 2.157 seconds, respectively. This increase in computation time is a limitation of this study. The proposed system is more accurate but also has high computation cost.

Table 15 Computational time (seconds) of machine learning models

4.5 Performance comparison with state-of-the-art studies

For analyzing the efficiency of the proposed feature union approach, performance analysis with other studies is also carried out. Some recent studies based on spam email classification are used for comparison. For example, [11] used RF for spam email classification. Similarly, [20] used SVM for header-based spam email classification to achieve significant results. Another study dealing with the same task is [21] that performed experiments for spam email classification using natural language processing techniques. They used several models for spam email classification and achieved the highest results using LR and Naive Bayes (NB). Similarly, study [16] used machine learning approach for spam email classification. They used Artificial Neural Network (ANN) model to achieve significant accuracy. For a fair analysis, accuracy and F1 score are used to make the performance comparison of the current study with the discussed studies, and results are given in Table 16.

Table 16 Performance analysis with respect to state-of-the-art studies on spam email classification

5 Conclusion

Internet users are exposed to several threats and spam emails present a potential tool for spammers to steal the financial and personal information of users. This study proposes a machine learning-based approach for spam email detection with high accuracy. For experiments, a hybrid dataset is made by combining two spam email datasets. For reducing the impact of data imbalance on models’ overfitting, random under-sampling is used on the majority class. Similarly, feature fusion is proposed by combining BoW and TF-IDF features to elevate models’ performance. Results indicate that RF achieves the highest accuracy of 0.991 and outperforms all other models. The significant performance of RF is due to its ensemble architecture and use of the proposed feature union approach. The small size of the dataset is complemented with feature union to increase the feature vector which helps to improve the performance. Besides RF, LR and SVM also perform better and obtain an accuracy of 0.99 each when used with feature union. Experiments using LSTM and GRU deep learning models show relatively low performance as compared to machine learning models. Furthermore, data under-sampling tends to improve the performance of deep learning models. This study has several limitations; the first is the high computational time of machine learning models with a feature union approach, and the second is the small size of the dataset with an imbalanced target class ratio. We will consider these limitations in our future work. We also intend to perform further experiments using over-sampling techniques to analyze its influence on deep learning models.