Keywords

86.1 Introduction

Feature selection [1] is a technique for discovering a minimal number of features f from the original F features of an email spam dataset. Some features are relevant for spam detection, but some are not with repetition and also it can be continuous, discrete, or nominal. It is recognizable to determine the finest attributes from the new email dataset that comprises the features with smallest number of dimensions which contributes by removing irrelevant data, reducing dimensionality to increase the accuracy and to improve the performance. This technique has optimal number of features which attain same or better results.

The features that are related to the spam detection and impact on spam detection are known as relevant features, while the rest cannot perform their role. If the features do not have any impact on spam detection, they are irrelevant features. If there are features with repetition, then they are redundant and even the features may be different. The real-world dataset used for spam detection may contain noise, irrelevant or ambiguous features where feature selection plays a vital role. Many algorithms, approaches, and methods are available for feature selection.

Feature selection algorithms [2] are predominantly characterized into three methods. They are:

  • Filter method

  • Wrapper method

  • Embedded method.

86.1.1 Filter Method

Filter method is to be applied for choosing the significant features that have to be done before classification. It is used to determine the best features from the input spam email. It is independent of any classification algorithm and filters data based on the selective criteria. The input of the filter is the attributes of email dataset. Based on the scores obtained from various statistical tests, features which are significant in determining the outcome variable can be selected. If there are more number of features, then the filter method can be more suitable because of the high computational efficiency. Some of the filter methods are correlation-based methods, mutual information-based methods, information gain, chi-squared test, etc.

86.1.2 Wrapper Method

Wrapper method is suitable for dataset that contains less amount of attributes. For the given email dataset, it finds the suitable attributes and trains a model using them. Based on the dependencies among features, attributes can be added or removed from the subset. It provides better result when compared with the filter method. But it requires more computational resources than the filter method and more appropriate for small training datasets. Few of them are Sequential Forward Selection, Genetic Algorithms, Stepwise Regression, Backward Elimination Method, etc.

86.1.3 Embedded Method

Embedded method has been projected to incorporate the benefits of filter method and wrapper method. In this model, some good features will be selected from an email spam dataset by using the filter method. Then wrapper method is applied on those selected features to acquire the best feature. For feature selection, one of two methods, such as the subset selection or the feature ranking method, can be used. The set of possible features is selected based on the criterion, forms the optimal subset in the subset selection method, and ranks the features according to the criterion in the feature ranking process. Weighted Naïve Bayes, Sequential Forward Selection, Artificial Neural Networks, etc., come under the embedded method.

86.2 Overview of Email Spam Detection

In recent years [3], email has become a platform that is extensively used in the Internet for communication. It is an electronic messaging system used to transfer message from one user to another. In the email, spam is the major concern, which transmits messages to bulk amount of beneficiaries. Spam email is also called as junk mail. Spammers usually collect these addresses from websites for spreading malware and sending phishing emails for stealing user confidential data. It consumes email storage space and wastes user time in opening and deleting the junk emails.

86.2.1 Dataset

There are various datasets available in the UCI Repository to detect spam mails. The dataset contains spam and ham mails. Some of them [4] are listed in the table (Table 86.1).

Table 86.1 Detailed analysis of datasets used in spam email classification

86.2.2 Evolutionary Algorithms

Evolutionary algorithm uses nature-inspired approach for optimization. It follows the behavior of living organisms to solve the problem and is inspired by the concepts in Darwinian Evolution and modern genetics. Evolutionary algorithm is intended for resolving a problem more quickly which will consume more time for thorough processing. There are four inclusive steps in EA, which are Selection, Mutation, Crossover, and Accepting. In EA, appropriate members will persist and increase, while irrelevant members will perish and not contribute for further generations. From the given population of individuals, natural selection is made by environmental pressure to rise the appropriateness of the population. Fitness measure is applied to the appropriate solutions which were randomly created. With this measure, the better possible solutions are applied with recombination or mutation and given to the next generation. Recombination operator is enforced on parents and results in the children. Mutation operator is enforced on one candidate to produce another new candidate. Depending on their fitness measure in the next generation, recombination and mutation replace old ones with a new candidate. This process will be iterated until the best solution is found.

86.2.3 Classification Algorithms

After selecting the finest attributes from an email spam dataset, Machine Learning algorithms are used for performing classification. Various algorithms are available to classify the non-spam and spam emails. Classification algorithms include Decision Tree, AdaBoostj48, Naïve Bayes (NB), Support Vector Machine (SVM), Random Forests (RF), Neural Networks, and Multi-Layer Perceptron (MLP). Then the performance is evaluated by using different metrics such as accuracy, sensitivity, and specificity.

86.3 Related Works

Some of the research works done on predicting the spam email by employing various Feature Selection techniques to select the best features for performing classification are deliberated in the following section:

Email is the instantaneous method for exchanging information through the Internet. Spam mails contain mischievous code to steal personal information about the user and also to infect user’s system through spreading viruses. In order to reduce the consequences of spam mail, Vrinda Sharma [5] proposed a Term Frequency and Inverse Document Frequency (TF-IDF) and Information Gain for efficient feature selection. Then, the result of these two feature selection is applied on four classification algorithms, namely, Support Vector Machine, Naïve Bayes, K-Nearest Neighbors (KNN), and Random Forest. It is tested on different datasets such as DBWorld E-Mails, LingSpam, and Enron dataset using classification algorithms like Naïve Bayes, KNN, Random Forests, and SVM. Random Forests and SVM provide a better result, but SVM takes more time. But Naïve Bayes and KNN are improved in terms of accuracy and time.

Spam is a major concern on the today’s Internet. To classify the spam emails, four feature selection techniques and Machine Learning algorithms are used for classification. Reshma Varghese [6] recommended Bag-of-Word (BoW)s, Bigram Bag-of-Word (BoW)s, Part-of-Speech (PoS) Tag, and Bigram PoS Tag for extracting the features. The Naïve Bayes score is used to eliminate the rare features. Enron dataset is taken as the input. Features are selected by Information Gain and form matrix using Term Frequency – Inverse Document Frequency (TF-IDF). For classification, AdaBoostJ48, Random Forest, and Popular Linear Support Vector Machine (SVM), called Sequential Minimal Optimization (SMO), are used and yield an accuracy of 0.932, 0.911, and 0.750 for Adaboost, Random Forest, and SMO. Adaboost provides good results with ensemble model.

Email is one of the fastest modes of communication used on a daily basis by millions of people. However, the number of email users has increased resulting in dramatic increase in spam mails over the past few years. P.U. Anitha [7] proposed an efficient spam classification technique using Naïve Bayes classifier and Compact Genetic Algorithm (CGA) by using SpamBase and LingSpam datasets. It contains training and testing phases. During the training phase, best features are selected using hybrid Cuckoo Search and Genetic Algorithm. After selecting the best features, classification is done by using Naïve Bayes algorithm. Performance was compared with existing techniques like Particle Swarm Optimization (PSO). The comparison indicates that the proposed system using hybrid optimization provides better accuracy.

Email is one of the important ways of exchanging information. Spam is serious concern in today’s Internet. So there is a need to filter the spam emails. Issam Dagher [7] recommended spam filtering using Kernel Principal Component Analysis. It is implemented using a Public Corpus extracted from the University of California-Irvine Machine Learning Repository. The best features are extracted using PCA. For classification using Support Vector Machine and Naïve Bayes, different training and testing sets are used. The spam mails are correctly classified for more number of trials and it takes less time comparable to PCA. Kernel PCA provides the best performance in terms of accuracy. The accuracy of the Bayes detector was high, but it takes more time for classifying large number of features.

Spam is the major problem faced by most of the email users, as it consumes large amount of email storage and steals all users’ personal data. Therefore, a filter is needed to block these spam emails. In the dataset, not all the features are relevant for spam classification. Thus suitable features should be extracted for further processing. Mehdi Zekriyapanah Gashti [8] chose various datasets such as SpamBase, LingBase, and PU1 and applied the Harmony Search Algorithm (HSA) to select the best features. Selected features help to improve the accuracy of its predictions. Then, Decision Tree is used for classifying the selected features. The proposed model on SpamBase dataset provides an accuracy of about 95.25% which is better than SVM, J48, MLP, and NB. And also, the accuracy of proposed model on LingSpam and PU1 dataset provides better result than LR, NB, and SVM.

Email has established a significant role in exchanging information because of its fastest and cost-effective way of communication. It plays a vital role in both personal and business communication. The rapid growth of email has generated several issues. From past decades, spam emails start spreading tremendously. These spam emails spread malware in user system and steal personal data by sending phishing emails. So, there is a need for efficient filter to classify the spam and ham mails. Harjot Kaur [9] proposes MLP for classifying spam emails. MLP takes more execution time and degrades the performance of algorithm. So in future work, refined MLP along with N-Gram-based feature selection is used to remove noise and outliers in the dataset and for selecting the best features from the corpus.

The communication tool is attacked by intruders for sending unwanted spam emails. Several spam filtering techniques exist, but still the problem survives. Masoome Esmaeili [10] addresses this issue by implementing the Bayesian method and PCA to filter these spam emails from the user inbox. Forty spam and fifty non-spam emails are considered in the training step and extracted the features and saved them with their frequencies in a local dictionary. Then, they were classified by using the Bayesian method and compared its result with various feature selection techniques. The ratio method was applied on the original dictionary in the preprocessing step to remove the irrelevant features. Then GA) was applied on modified dictionary and obtained 97.76% with 3400 features.

In this digital world, spam causes serious problem to the Internet users. In this paper, T. Kumaresan [11] suggested a modified Cuckoo Search called Stepsize Cuckoo Search (SCS) and Support Vector Machine for spam email classification. SCS is used which not only speeds up the convergence of the algorithm but also allows us to find the optimal features from the SpamBase dataset. Then the classification is done by using the Support Vector Machine. For the effectiveness of classification, three different kernels, such as linear, polynomial, and quadratic, are used. The proposed system is evaluated by different metrics like precision, recall, and accuracy, and the result shows that it provides better result when compared with the existing classification technique.

Due to cost-effective communication, email is widely used for personal and business communication for transferring messages. Spam has become a major problem because it causes unnecessary traffic and security threats. Several techniques have been deployed to block these spam emails. Shashi Kant Rathore [12] proposes the hybrid Bayesian algorithm and swarm intelligence for recognizing spam mails. Best features from the LingSpam dataset are selected by using swarm intelligence and the classification is done by using the Naïve Bayes algorithm. This approach takes static values of probabilities for each token. So, an automated trained filter can also be maintained by including Nature-based optimization techniques such as Artificial Bee Colony and Spider Monkey Optimization. From this, the best tokens can be classified to recognize the spam mails.

In this e-world, email stands out for communication in the Internet. Because of its popularity, it is misused by people for sending unwanted messages to large number of recipients. These emails are called as spam emails. Spam email lessens the productivity, consumes extra storage in the mailbox, takes up a lot of time for opening and deleting the mails, spreads viruses, and steals user’s data through phishing emails. So, there is a need to block the spam mails from entering the user’s inbox. Shradhanjali [13] suggested a novel method using Support Vector Machine and feature extraction. The proposed system obtains an accuracy of about 98% with the test datasets.

Spam email causes severe problem to the Internet community that threatens network bandwidth and productivity of the users. T. Kumaresan [14] recommended a framework using S-Cuckoo and Hybrid Kernel based Support Vector Machine (HKSVM). At first, textual features are selected from the LingSpam dataset using Term Frequency and for images, correlogram and wavelet moment are taken. The optimal features are selected using hybrid S-Cuckoo Search. After selecting the features, classification is done using HKSVM. Then, the performance is analyzed by using evaluation metrics such as precision, recall, and accuracy. Experimental result shows that the proposed HKSVM provides better result when compared to other SVM-based models (Table 86.2).

Table 86.2 Performance analysis of email spam detection using feature selection and classification techniques

86.4 Conclusion

This paper provides an overview of various feature selection techniques that can be used for email spam detection. In the input email dataset, all the attributes are not relevant for detecting the spam mails. Some features are relevant, but some are irrelevant. So, there is a need to eliminate the irrelevant features that lessen the execution time and provide better accuracy. Feature selection is used to select the best attributes from the spam email dataset. It reduces the dimension of the input and after that, it uses classification techniques to classify the spam emails, which helps in improving the performance of spam detection.