Detecting ham and spam emails using feature union and supervised machine learning models

Rustam, Furqan; Saher, Najia; Mehmood, Arif; Lee, Ernesto; Washington, Sandrilla; Ashraf, Imran

doi:10.1007/s11042-023-14814-2

Detecting ham and spam emails using feature union and supervised machine learning models

Published: 08 March 2023

Volume 82, pages 26545–26561, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Detecting ham and spam emails using feature union and supervised machine learning models

Download PDF

Furqan Rustam¹,
Najia Saher²,
Arif Mehmood²,
Ernesto Lee³,
Sandrilla Washington⁴ &
…
Imran Ashraf ORCID: orcid.org/0000-0002-8271-6496⁵

443 Accesses
3 Citations
10 Altmetric
1 Mention
Explore all metrics

Abstract

Spam emails are cyber nuisances that cause serious security threats including personal and financial information. Although several spam detection approaches exist, detecting new strains of spam messages is challenging that requires a reliable and efficient intelligent spam email detection approach. This study utilizes features from the text of emails to determine whether it is spam or normal. Multiple features are combined to obtain a higher accuracy for spam email detection. Experiments involve machine learning and deep learning models and the influence of data resampling is also investigated. Performance analysis is done using F1 score, recall, precision, and accuracy, as well as comparison with state-of-the-art approaches. Random forest and logistic regression achieve the highest accuracy scores 0.991 and 0.990, respectively which is much better than existing models.

A Comprehensive Review of Fraudulent Email Detection Models

Email Spam Detection Using Naive Bayes and Random Forest Classifiers

Comparative Analysis for Email Spam Detection Using Machine Learning Algorithms

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Internet users are exposed to several threats including personal and financial information theft, damage to sensitive information stored on a computer, ransom demands, unauthorized online purchases, etc. The users are prone to these and similar other threats where the attacker uses computer viruses, spam messages to obtain the user’s private information, ransomware, and similar other tools. Spam messages often contained in e-mails have become a frequent tool for stealing users’ information. Aiming at stealing financial information, such e-mails contain malware files, invitations, and uniform resource locator (URL) links that lead to various malware-hosting and phishing websites. Over the past few years, spam emails have been increased substantially, as reported in [2] which indicates that the phishing e-mails for 1^st, 2^nd and 3^rd quarters of 2020 are 118260, 132553, and 139685, respectively.

Spamming indicates any fraudulent activity that targets the financial and personal information of internet users and involves social engineering and similar other concepts. Spam emails are designed to show that they are from genuine and registered companies which the user may be using. The idea is to lure the user to click the provided link for further information or verification. Once the user clicks the provided link, the information can be gathered by the attacker. Such attacks can be detected using available programs and models to a certain extent, yet, change in the design and strategy of such attacks makes the detection more difficult and complex for the available whitelist or blacklist-based techniques. The classification accuracy of such techniques is reduced over time if they are not updated [13] as the strategy and structure of such attacks have evolved. Similarly, a large number of auto-generated emails makes it a time-consuming process and further increases the complexity of spam email detection. Research indicates that of the 205 billion emails sent every day, approximately 22.8% are unnecessary and 18.5% are irrelevant [19].

Automated systems are developed by the researchers for different purposes such as spam email detection [17, 30], health care systems [6,7,8], anomaly detection [1, 14], etc. This study also contributes to spam email detection using machine learning techniques. Electronic mail (e-mail) has become the most common source for spammers to steal sensitive information [10] and developing an automatic system to detect spam email is very important to safeguard individuals and companies alike. Despite the availability of several spam detection techniques, the provided accuracy is not up to the standard. Furthermore, predominantly these techniques require longer training time and the false positive rate is high. Devising an approach that can detect spam emails before they are opened is critical. The available solutions do not possess this capability despite being sophisticated and adaptive. This research aims to solve this problem by employing machine learning algorithms and various feature extraction techniques. This study introduces an approach for spam e-mail detection and makes the following contributions

This study proposes an approach for spam e-mail detection using features from textual data. Two important feature extraction techniques are investigated in this regard.
Besides using term frequency-inverse document frequency (TF-IDF) and bag of words (BoW), an intuitive feature extraction approach, feature union, is introduced that combines TF-IDF and BoW to make an effective feature set.
To resolve data imbalance, experiments are performed with under-sampling, and results are analyzed to investigate the impact of undersampling on the performance of the machine and deep learning models.
Several machine learning models are employed for this purpose including random forest (RF), gradient boosting machine (GBM), support vector machines (SVM), Gaussian naïve Bayes (GNB), and logistic regression (LR). The performance of machine learning models is enhanced by optimizing several hyperparameters. Deep learning models such as long short term memory (LSTM) and gated recurrent unit (GRU) are also adopted for spam email detection.
Extensive experiments are carried out and performance is evaluated using accuracy, precision, recall, F1 score, and micro average. In addition, the performance is compared with several state-of-the-art models.

The rest of the paper is organized as follows. Section 2 discusses important research works related to the current study. The proposed approach, dataset used for experiments, machine learning models, and sampling approaches are given in Section 3. Section 4 provides results and discussions while the conclusion is given in Section 5.

2 Related work

Spam emails contain advertising messages, as well as, URLs and file attachments for stealing the personal and financial information of the users. The advertising emails are considered to be legal as long as the content is not fraudulent; these can be considered spam only if the emails contain any unsolicited content [15]. Spammers frequently work to discover techniques to make a spam email look legitimate to dodge email filters. One of the major problems is that spam has different forms that can be considered a legitimate message [10]. Due to the importance of spam detection, a large number of research works can be found in the literature. Both machine learning and deep learning approaches have been adopted for spam classification. To entertain the required need, some spam-related studies have been discussed in this section.

Various machine learning techniques based on spam detection have been used by researchers. Specific keyword pattern in emails for spam detection is used in most of the existing statistical models. For example, [9] explored the major characteristics of spam by reviewing the content-based spam detection techniques. Both statistical and non-statistical methods are used for spam detection, however, the statistical approaches appear to be more effective. At first, the SMS spam collection dataset is collected for training and classification. Later, classification is done using the decision tree (DT), LR, and k-nearest neighbor (KNN). Results show that LR outperforms with the highest accuracy of 99%. Francisco et al. [17] proposed hierarchical clustering and a combination of supervised learning for spam detection. The clustering algorithm is used to generate SPEMC-11K (Spam Email Classification) and emails are categorized into multiple classes. The obtained dataset consists of three distinct classes including health and technology, sexual content, and personal scams. Moreover, various combinations of TF-IDF and bag of words (BoW) feature embedding are applied. Spam emails are classified through SVM, LR, and Naïve Bayes. Results indicate that the NB with TF-IDF has the best classification speed and SVM combined with TF-IDF outperforms all the other combinations with the highest accuracy of 95.39%.

The study [5] used the spam base UCI dataset for spam classification using ten state-of-the-art classifiers. Similarly, infinite latent feature selection (ILFS) is employed to select the most relevant features from the dataset. 10-fold cross-validation is used for SVM, radial bases function (RBF), decision table (DT), Bayes net (BN), KNN, NB, random tree (RT), LR, ANN, and RF. RF tends to show superior performance by achieving 95.45% accuracy. The authors propose a framework that uses S-Cuckoo and hybrid kernel-based SVM for email spam classification in [24]. Both text and image features are extracted from emails where TF features are used for text data, and Correlograms and wavelet moments for image data. The HKSVM model is designed by combining three different kernel functions to form a hybrid function that achieves an accuracy score of 95%. A comparative study based on data mining techniques used Fisher filtering (FF), Relief-F, stepwise discriminant analysis (StepDisc), and runs filtering techniques for feature selection [23]. The classifiers including random Tree, LDA, MLP, NB, KNN, SVM, and LR-Trials are applied for spam classification. The combination that outperformed all the employed methods is RF Tree which achieved 99% accuracy when applied with the FF technique.

An NB approach for spam classification is performed in [30] where NB has been applied on two different datasets UCI spam base dataset and Spam data. The UCI spam base dataset is used to train the model while the performance is tested on the Spam data. Results show that the number of instances of the dataset and the type of email has an impact on the performance of NB and the classifier achieves an accuracy score of 91.13%. The study [36] pursued various machine learning methods to make a hybrid model for enhancing spam classification accuracy. Feature selection has been performed by information gain, Chi-square, and gain ratio methods. The hybrid classifier uses a stacking method and builds a Meta learner to make the prediction-based Meta classifier. The applied hybrid classifier involves various combinations of sequential minimal optimization, SVM, NB, and J48 from decision tree algorithms. The best accuracy score of 93.22% is achieved by using J48 and NB with J48 as the Meta classifier.

The use of artificial neural networks (ANN) is reported to show better performance than traditional machine learning models in [12]. ANN is used with backpropagation (BP) and the combination of backpropagation with momentum (BP+M) on the UCI spam base dataset [33]. The BP+M optimized ANN shows better performance with an accuracy score of 95.38% with less training time. The authors utilize a feature-centric spam email detection model (FSEDM) with novel and existing features in [35]. Several sets of features are used including user-based, content, semantic, sentiment, and spam lexicons. Sentiment features are used along with the proposed features to perform the classification. The feature selection is performed through information gain, Relief-F, and gain ratio methods. For classification, SVM, bagging, RF, AdaBoost, DNN, J48, and MLP have been used where DNN shows the best performance with an accuracy of 97.2% when applied with sentiment features.

Similarly, the study [32] focuses on using a convolutional neural network (CNN) approach for spam classification. For email classification containing both text and image data, a hybrid multimodal architecture is proposed containing one CNN each for text and image. GloVe word embedding is used for multi-modal feature fusion, and a multi-modal learned rule is proposed for spam detection. The achieved accuracy is 98.11% using the Enron spam dataset. An ANN is used with radial basis function neural networks (RBFNN) to classify spam e-mails in [4]. The approach combines particle swarm optimization (PSO) algorithm with RBFNN for spam detection. The PSO algorithm is used to optimize the appropriate position c for the applied model, the singular value decomposition algorithm is used to optimize weights w and the radii r is optimized by using KNN. Experiments conducted on the UCI spam base dataset show 91.4% accuracy.

Despite the availability of sophisticated spam detection approaches, the provided accuracy is not up to the standard. In addition, existing approaches are not adaptable and robust. A comprehensive summary of the discussed research work is presented in Table 1. This research aims to fill this gap by introducing an effective approach for spam classification with high accuracy.

Table 1 Review of the discussed research works

Detecting ham and spam emails using feature union and supervised machine learning models

Abstract

Similar content being viewed by others

A Comprehensive Review of Fraudulent Email Detection Models

Email Spam Detection Using Naive Bayes and Random Forest Classifiers

Comparative Analysis for Email Spam Detection Using Machine Learning Algorithms

Explore related subjects

1 Introduction

2 Related work

3 Materials and methods

3.1 Proposed methodology

3.1.1 Dataset description

3.1.2 Preprocessing

3.1.3 Feature union

3.1.4 Under-sampling approach

3.1.5 Supervised machine learning models

4 Results and discussions

4.1 Results of machine learning models without re-sampling

4.2 Performance of machine learning models with data under-sampling

4.3 Classification using deep learning models LSTM and GRU

4.4 Computational complexity of models

4.5 Performance comparison with state-of-the-art studies

5 Conclusion

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation