1 Introduction

Spam is generally an unsolicited electronic junk mail that includes text messages, images or videos and is sent without the consent of the recipients. Spam messages are broadcasted to large number of email users occupying larger bandwidth. It is not only obstructing the network traffic but also forms a base for email viruses and denial of service attacks. Moreover, spam messages contain mostly offensive and fraudulent texts that are unpleasant to the recipients. Email users are drowned with nearly 50% of spam messages with new content and new addresses in their inbox daily. Spam messages may destroy email servers with potentially harmful information and the users need to spend certain amount of time to identify and analyze spam messages in their inbox and delete them. Several spam filtering techniques are proposed to identify solicited and unsolicited messages; however, email spammers use dynamic new structures to thwart all the techniques and conceal email content. The main problem that arises with spam is that spammers devise new ways to attack the spam filters and thereby benefit from sending large amount of spams. The primary challenge is to develop a system that can deal with newly arising spams. Existing spam filtering techniques are either list-based filters or content-based filters or a collaborative response system which generally identify duplicate contents, fraudulent texts or the disreputable servers. Though these filters provide better accuracy rate in spam classification, they are prone to erroneous misclassification of hams as spams. This paper proposes a collaborative Spam detection system that uses email layouts and fingerprints to identify spam messages. Collaborative approach collects the feedback from the users regarding what mails are spams and consequently develops a model against it. The incoming emails are mapped to a known spam database using near duplicate matching scheme. Overall three key processes are involved in this spam detection approach. First, a layout abstraction set is generated from the HTML content of the ENTIRE email. Secondly, from the abstraction set, fingerprints are generated and weights are calculated for each fingerprint. Finally, the near duplicate matching process is carried out with the generated fingerprints and the detection of spam is carried out.

2 Related Work

Email spam filtering is an exploring area due to the increasing nature of spam emails. Various technologies [1, 2] have been proposed to classify legitimate mails and spam messages. Depending upon how the techniques work, previous spam detection mechanisms can be categorized under three divisions.

  1. (i)

    List Based Filtering

  2. (ii)

    Content Based Filtering

  3. (iii)

    Other Filtering methods.

List based filtering methods blocks or allows the email messages by categorizing email users as spammers or trusted users. Content based filtering works by evaluating words and sentences extracted from the email to classify under ham or spam category [3,4,5,6]. Such techniques include word-based filters, rule-based filters and probability-based filters. Naive Bayes and SVM classification methods falls under content-based filtering. Naïve Bayes [7,8,9,10,11,12] trains the filtering model using classified emails with probability value for each suspicious word. Support Vector Machine based models [10, 13, 14] are supervised learning models. Various other content-based classification techniques include Markov random field model [15,16,17], Regression models [18, 19], and Neural network models [20, 21].

Other filtering methods such as Collaborative Spam filtering [22,23,24,25,26,27,28,29] make use of users’ collective feedback reports to check for spam messages. It takes input from millions of email users. Every incoming mail is flagged as spam or ham and is reported to a central spam database. For every new spam encountered, it should be reported to the spam database by some users. Subsequent users receiving emails can query with the spam database to decide whether the message is already marked as spam. P2P-based architecture [30, 31], centralized server-based system [27, 32] are generally representatives of collaborative filtering methods.

In [22, 29, 30], digest technique generates a 32-byte code to represent the distribution of word trigrams in e-mail. Multiple digests produced from strings of fixed length sampled at random positions are discussed in [23, 28, 33]. A fingerprint-based feature vector obtained from set of checksums over a substring is proposed in [34, 35]. Collaborative method of spam filtering is given in [25]. Most of the methods generate email abstraction based on text content [36,37,38,39,40,41]. However, if a random paragraph is inserted, these methods will fail to capture such intentional spam mails. This raises the need to devise a better approach that withstands the cunning nature of spam messages.

3 Proposed Methodology

The proposed methodology introduces a new process of generating an abstraction from the email. Instead of generating the tag sequence from the message content as discussed in [25], the aim is to generate the abstraction for the entire email layout using the HTML tags. Both the < head > and < body > tags of the email content are processed to generate the abstract. Individual tags are processed to a new tag. The generated new tags are then used to generate the fingerprints. Fingerprint generation serves two purposes.

  1. (i)

    To uniquely abstract the email tags and minimize the storage space.

  2. (ii)

    For efficient duplication detection.

The fingerprints of the generated abstracts are manipulated for near duplicate matching process. The classification of Spam and Ham is finally done in the Spam Detection process. The proposed model of the system is illustrated in Fig. 1.

Fig. 1
figure 1

Abstraction based fingerprint model for Spam detection

3.1 Email Layout Abstraction

Each email is represented by its layout. Since all emails follow MIME format, HTML tags are available in MIME text/html format and thus adequate information related to structure of email can be obtained. In Email layout abstraction, each paragraph is represented as a sequence of tags. Duplicate emails are identified by comparing the HTML tag sequence of the two emails. As the initial process, tag preprocessing is done to identify and remove the tags that are in common. It also prevents spam insertion. This process eliminates mismatched non-empty tags and also tags with no corresponding end tags. The procedure for generating layout abstraction is as follows.

figure d

Mainly the emails fall under three categories. The first type of email is collected as feedback from the users and is named as reported spams. The second category emails are the one that are matched against the reported spams. These emails are called as testing emails. Finally, misclassified hams are also to be monitored and hence Email layout abstraction is done for all the three categories of emails.

3.2 Fingerprint and Weightage calculation

Fingerprints are technologies that are mainly used for biometric authorization. When used in data and information retrieval area, the fingerprints provide a precise short way of representing large information content. They are in fact tags of shorter length for a longer text. It has the advantage of near duplicate matching with minute variations. In spam detection, fingerprints are used as a digest value to represent large spam contents. After the generation of email layouts, fingerprints are generated over the segregated layout abstraction set. Fingerprints are generated functions denoted as \(f\left(x\right)=k\left(i\right)\to {\{\mathrm{0,1}\}}^{len}\), where \(k\left(i\right)\) denotes the tags of interest and \(len\) represents the length of a tag sequence. Rabin Fingerprinting scheme is used to generate the fingerprint from the abstraction. It works with an irreducible polynomial \(p\) of certain degree \(d\).Having \(p\), the fingerprint for a LAS will be computed with a function \(fun\left(LAS\right)=LAS\left(t\right)mod p(t)\).

The algorithmic steps for fingerprint generation are shown in Algorithm 2.

figure e

An email message is partitioned into various components (em) with individual weights (w). For each component, em, fingerprint is generated (fp(em)). The final weight (fw)for the entire email message (em) is computed as

$$fw\left(em\right)= \sum_{i=1}^{n}{w}^{i}\sum_{j=1}^{m}{fp(em)}^{i}$$

3.3 Near duplicate Matching Process

Matching process compares each testing email with a known spam database. An indicator score for each email is calculated and spam mails are identified with a threshold value.

$$Spam\left(em\right)= \left\{\begin{array}{c}Spam ,\, if\, fw\left(em\right)>t\\ Ham ,\, if\, fw\left(em\right)\le t\end{array}\right\}$$

The fingerprints generated from the email components are maintained in a tree data structure. Here, we define a tree to be an ordered data structure that stores the fingerprints as dynamic sets with leaf nodes as strings. Initially the tree will be empty. After the extraction of fingerprints, insertion happens if the fingerprints are not already in the tree. The tree is built in the training phase with the known spam database fingerprints. Fingerprints of each abstraction is collected and put into the leaf nodes to the root nodes of the tree (ie) from low level of the tree to the higher level. A traversal from the root node to a leaf node demonstrates one single abstraction fingerprint. Matching of the email abstraction fingerprint is done from the root node to the leaf node. Matching process is also checked with the cumulative weights of the fingerprints compared with a threshold value.

3.4 Reputation Evaluation

The main goal of collaborative email spam detection is to build a known spam database collecting feedback from different users. This known spam database is used to match the incoming mails and further block the near duplicate spams. To validate the feedback of the users regarding the truthfulness of spams reputation evaluation is carried out for each reporter.

  • A score is assigned to each reporter when he reports a spam fingerprint for the first time.

  • For each reporter, current reputation score is calculated with respect to the last score and current feedback multiplied with a weight.

  • If the feedback is found to be true, the reputation score is incremented by a small value.

  • If the feedback if found to be false, the reputation score is decremented by half of its value.

Considering the multiplication factor as β, the reputation is calculated as follows

$$Repute\left(em\right)= Repute\left(em-1\right)+\beta \times feedback(em)$$

To avoid errors in reputation evaluation, the score is incremented in a minimum value and for false positive errors it is dropped drastically.

4 Experiments and Results

The proposed system model is experimented with a well-known spam system, SpamAssassin. It is assumed that there are about 45,000 spams per day and 5,000 legitimate mails in the data set. The model is compared with density based, digest based and collaborative methods. The authors adopt the email representation method with different parts of the email. Time of execution of the model and the accuracy in detecting spams are evaluated. Execution time of generating the email abstraction layout is compared with the other methods and the results are shown in Fig. 2.

Fig. 2
figure 2

Execution Time analysis with varied number of emails

The detection accuracy of spam and non spam is also experimented and is compared with the other two approaches. The experiment is set to check whether the proposed spam detection system is withstanding intended malicious attacks. In collaborative approach a set of already reported spams are inserted in to the database and the near duplicate matching scheme is processed. The model is evaluated for 10 days spam and the True Positive (which is a real spam) rate and False Positive (a misclassified ham as spam) rate is observed. The table values in Fig. 3 lists the TP and FP rates for 10 days.

Fig. 3
figure 3

Accuracy Evaluation of Spam detection

As seen in Fig. 3, the proposed model reports on average 95.13% of True Positive rate and 0.511% of False Positive rate. This shows an efficient performance when compared to the other two schemes. Though Collaborative Reputation method has reported highest True Positive rate, its highest False Positive rate of 83.184% is highly unacceptable. When compared with COSDES, the proposed model achieves better performance both in True Positive rate and False Positive rate.

5 Conclusion

The field of collaborative spam detection represents reported spams and the near duplicate matching process for efficient spam detection and filtering. Email content alone should not be taken into consideration for near duplicate matching as the evolving nature of spams varies with respect to email layouts. This paper explores the enhanced version of email layout abstraction method combined with fingerprint generation for near duplicate matching. This enhanced feature can more effectively capture the cunning spams. The spam database is also updated with the newly evolving spams and is kept up-to-date for blocking subsequent spams. Experimental results prove the effectiveness of the proposed method recommending this approach for efficient spam detection.