SMSAD: a framework for spam message and spam account detection

Adewole, Kayode Sakariyah; Anuar, Nor Badrul; Kamsin, Amirrudin; Sangaiah, Arun Kumar

doi:10.1007/s11042-017-5018-x

SMSAD: a framework for spam message and spam account detection

Published: 21 July 2017

Volume 78, pages 3925–3960, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

SMSAD: a framework for spam message and spam account detection

Download PDF

Kayode Sakariyah Adewole^1,2,
Nor Badrul Anuar¹,
Amirrudin Kamsin¹ &
…
Arun Kumar Sangaiah³

1254 Accesses
24 Citations
Explore all metrics

Abstract

Short message communication media, such as mobile and microblogging social networks, have become attractive platforms for spammers to disseminate unsolicited contents. However, the traditional content-based methods for spam detection degraded in performance due to many factors. For instance, unlike the contents posted on social networks like Facebook and Renren, SMS and microblogging messages have limited size with the presence of many domain specific words, such as idioms and abbreviations. In addition, microblogging messages are very unstructured and noisy. These distinguished characteristics posed challenges to existing email spam detection models for effective spam identification in short message communication media. The state-of-the-art solutions for social spam accounts detection have faced different evasion tactics in the hands of intelligent spammers. In this paper, a unified framework is proposed for both spam message and spam account detection tasks. We utilized four datasets in this study, two of which are from SMS spam message domain and the remaining two from Twitter microblog. To identify a minimal number of features for spam account detection on Twitter, this paper studied bio-inspired evolutionary search method. Using evolutionary search algorithm, a compact model for spam account detection is proposed, which is incorporated in the machine learning phase of the unified framework. The results of the various experiments conducted indicate that the proposed framework is promising for detecting both spam message and spam account with a minimal number of features.

Machine intelligence based hybrid classifier for spam detection and sentiment analysis of SMS messages

Article 23 February 2023

Non-email Spam and Machine Learning-Based Anti-spam Filters: Trends and Some Remarks

A Comparative Study of Spam SMS Detection Techniques for English Content Using Supervised Machine Learning Algorithms

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the past few years, short message communication media, such as mobile and microblogging social networks have become essential part of many people daily routine. Mobile devices offer a plethora of textual communication and provide convenient platforms for users to carry out different activities, such as accessing resources on the Internet, e-banking transactions, entertainments, instant messaging and Short Message Service (SMS). The number of mobile users has dramatically increased in the recent years with an estimate of over 7 billion subscriptions globally [19]. The common form of textual communication between mobile devices is the use of SMS, which utilizes standardized communication protocols to enable mobile phones exchange short text messages with 160 character long [6]. On the other hand, microblogging online social networks, such as Twitter and Sina Weibo, have been utilized for a range of social activities including the posting of interesting contents about past experiences, locating long-lost friends, posting photos and videos, building communities joined by families, acquaintances, and friends. Microblogging social networks have been in existence for almost a decade. For instance, the launch of Twitter in 2006 witnessed a rise in the number of microblogging platforms [2, 44]. The common characteristic of microblogging networks is that they allow users to share short messages usually called microposts or tweets with a maximum of 140 characters. These distinguished characteristics of SMS and microblogging messages forced users to introduce many domain-specific words. As a consequence, the traditional semantic analysis approach for spam detection degraded in performance [6, 24]. The increasing popularity of microblog and mobile communication media has attracted the attention of spammers who utilize the platforms to spread bogus contents [1, 6, 13]. Despite the various benefits offer by mobile and microblogging platforms, they have become the popular media for distributing spam messages [26, 46].

Spamming is a method of spreading bulk unsolicited contents usually for the purpose of advertisements, promoting pornographic websites, fake weight loss, bogus donations, fake news, online job scams, and a host of other malicious intents, which are perpetrated by spammers. The rise in spamming activities on various communication media has long been investigated. For instance, between the year 2009 and 2012, Akismet identified over 25 billion comment spams in Wordpress blogs and the proportion of email spam traffic generated in 2013 was about 69.6% [12]. The problem of spam distribution on communication media has spanned beyond email and blog communication platforms. The increasing rate of mobile SMS spam messages was analyzed in Cloudmark report [6]. This report revealed that the distribution rate of mobile SMS spam varies according to regions. For instance, in the part of Asia, about 30% of mobile messages were represented by spam. An estimate of 400% increase in unique SMS spam campaigns was witnessed in the U.S during the first half of the year 2012 [6]. According to the Nexgate report in 2013, social spam has grown for almost 355% and every seven new social media accounts created contain at least five spammers’ accounts [34]. As a result, mobile and social networks are becoming the target for spam distribution. Aside the use of microblogging social network for spreading spam contents, spammer also creates fake profiles to mislead legitimate users. They engage in underground market services where spammers can purchase fake followers to boost their profile reputation. This illegal behavior hinders the reliance on the information generated on microblogging social network and negatively affects the systems that utilize followers and friends’ connections to predict user’s influence [25].

Unlike email spam corpus with rich contextual information and large public datasets, mobile and microblogging messages are usually shorter, which permit inclusion of entities, such as abbreviations, bad punctuations, shorten URLs, and emoticon symbols. These characteristics degrade the performance of email spam detection filters when utilized to identify spam contents in short message communication media [6, 12]. In addition, the use of traditional content-based analysis using bag of word model have produced low detection accuracy [17]. Thus, majority of the studies on spam message detection has focused on email and webpage spam filtering [32, 36]. Recently, research in short message spam identification has witnessed a growing interest in the research community and several approaches have been studied. For instance, Almeida et al. [6] introduced raw non-encoded SMS spam collection corpus known to be the largest public SMS spam dataset in the literature. The authors proposed several classification models to benchmark the dataset and found that Support vector machine (SVM) outperformed other classifiers. However, the accuracy of this baseline models still need to be improved and further analysis of the proposed corpus is a welcome development. Martinez-Romo and Araujo [31] proposed statistical language based model and content analysis techniques to detect spam message in trending topics on Twitter microblog. Chan et al. [12] studied adversarial attack on short message spam detection filter and proposed a reweight method with a new rescaling function to combat evasion on spam filters. Although the proposed model increases the security level of spam message detection system, however, its classification accuracy on untainted samples drops significantly. El-Alfy and AlHasan [19] proposed a Dendritic Cell Algorithm (DCA) inspired by the danger theory and immune based systems to detect email and SMS spam messages. In this paper, we consider both spam message and spam account detection problems within a single framework in order to provide a more efficient and compact method to combat evasion on spam filters.

Exiting studies on spam accounts detection have utilized different detection approaches [21, 26]. For example, Ghosh et al. [21] applied social network analysis to distribute trust values using both known spammers and legitimate accounts as initial seeds. The algorithm, Collusionrank, assigned trust and untrust values to the neighbor of the selected seeds. The value assigned to each account depicts the strength of trust and for identifying other spammers on the network. Since the number of seeds is very limited taking into consideration the overall size of Twitter microblogging network, the initial score of the original seeds can dilute easily. This may propagate imprecise scores to many accounts on the network, which are less efficient to rank unknown users as spammers or legitimates [29]. Another line of research focused on identifying features for spammer detection, which can be utilized to train machine learning algorithms. For instance, Lee and Kim [26] proposed five name-based features from Twitter account group. The problem with this approach is the evasion of name-based features. For example, spammers can break this detection method using different character combinations to generate account names that mimic the characteristics of the legitimate account. In addition, the use of underground markets to purchase fake followers and tweets further limit the capability of existing solutions that rely on the number of followers and tweets. Hence, it is important to investigate the different features that can be used to identify spam message and spam account in order to provide a compact and more secure spam detection system.

This paper proposes a unified framework that can detect both spam message and spam account in short message communication media. By exploring five categories of features using bio-inspired evolutionary search algorithm, a compact model for spam account detection in microblogging social network is proposed. The paper further identifies minimal features that can be utilized to detect spam message on both mobile and microblogging platforms. Through rigorous experiments using ten (10) machine learning algorithms, the best classifier for the unified approach is identified. In particular, the contributions of this paper are summarized as follows:

1.
Propose a unified framework for spam message and spam account detection (SMSAD), which explores a minimal number of features to provide effective spam filter for short message communication media.
2.
Apply bio-inspired evolutionary search algorithm to identify reduced features for spam account detection in Twitter microblogging social network.
3.
Introduce a set of unique features to complement the features proposed in the related studies.
4.
Train and test ten (10) classification algorithms to identify the best classifier for the proposed unified framework.
5.
Propose Random Forest classifier as the best algorithm for spam message detection and LogitBoost classifier as the best algorithm for spam account detection, which are incorporated in the machine learning phase of the unified framework based on the results of the various experiments conducted.

The remainder of this paper is organized as follows. Section 2 discusses related work on spam message and spam account detection. Section 3 presents a detailed discussion of the proposed method. Section 4 highlights the results obtained from the different experiments conducted. Section 5 discusses how the results of the proposed unified spam message and spam account detection framework are compared with the related studies. Finally, Section 6 concludes the paper and highlights future directions.

2 Related work

Research in spam message and spam account detection in communication media has received growing interests in the recent years. Spam message detection studies the textual information posted by spammer using techniques such as natural language processing with machine learning [9, 12, 31]. Majority of the studies in spam message detection focus on content-based analysis and treat textual contents as collection of documents where individual message is preprocessed and represented using vector space model (VSM). VSM is a widely used method for text representation. Each vector identify by VSM is described using bag of word model where a document is represented as the bag of words it contains neglecting grammar and words order. Individual document can further be represented using Boolean occurrence of each word in the document or by counting the frequency of occurrence of each word [17, 48]. A more sophisticated scheme using Term Frequency Inverse Document Frequency (TF-IDF) has been studied to establish the importance of a word in the document [37]. Authors have proposed Bayesian model for SMS spam classification using content analysis techniques [11, 48]. Yoon et al. [45] combined content analysis and challenge-response to provide hybrid model for mobile spam detection. The content based spam filter first classify message as spam, legitimate or unknown. The unknown message is further authenticated using a challenge-response protocol to determine if the message is sent by human or automated program. El-Alfy and AlHasan [19] introduced a DCA algorithm to improve the performance of anti-spam filters using email and SMS data. Chan et al. [12] investigated the capability of existing spam filters in defending against an adversarial attack. The authors introduced a reweight method with a new rescaling function to prevent an adversarial attack on linear SVM classifier. Although the proposed model increases the security level of spam filter, however, its classification accuracy on untainted samples drops significantly.

Research in spam account detection in social networks utilized three major approaches, which include blacklist, graph-based, and machine learning. The first of its kind blacklist-based analysis on Twitter was investigated by Grier et al. [23]. The authors demonstrated that 8% of 25 million links shared on Twitter point users to phishing, malware, and different scams websites, which are listed on the most popular blacklists. They also found that a large proportions of accounts used for spamming on Twitter were hijacked from legitimate users. A further analysis of the clickstream data of users’ activities confirmed that Twitter is a successful platform for distributing spam messages. Grier et al. [23] investigated the effectiveness of using blacklist approach to reduce spamming activities. However, the authors discovered that blacklists method is too slow in detecting new social threats, exposing more than 90% of legitimate users to spam risk. It takes a longer period before a newly identify malicious link is flagged by the popular blacklists. In addition, blacklist based approach is sometimes platform-dependent. For instance, a malicious link caught by Google Safe Browsing may be unidentified by URIBL blacklist, making spam account detection filter depends on many external resources.

In graph-based method, social network is modeled as a network consisting of nodes (users) and edges (connections). The connections between nodes are analyzed in order to detect accounts with unusual characteristics [2]. This method has proved suitable for separating spam account from legitimate ones. For example, Ahmed and Abulaish [4] applied Markov clustering (MCL) algorithm to group a set of profiles as spam and non-spam. The MCL algorithm takes a weighted graph as input and uses random walk approach to assign probabilities to each node on the network. Based on the assigned probabilities, the algorithm is able to cluster set of profiles using Frobenius norm. Ghosh et al. [21] analyzed link farming activities on Twitter and proposed a Collusionrank algorithm, which penalizes users that connect with spammers on the network. This approach discourages the activities of link farming by lowering users score for establishing suspicious connections to malicious accounts. The algorithm assigned trust and untrust values to the neighbor of the accounts chosen as initial seeds. The value assigned to each account depicts the strength of trust and for identifying other spammers on the network. Since the number of seeds is very limited taking into consideration the overall size of OSN, the initial score of the original seeds may propagate imprecise scores, which are less efficient to rank unknown users as spammers or legitimates [29]. While graph-based method provides suitable approach to identify spammers on social network, one of the notable weaknesses lies on the computational complexity when dealing with large social network graph.

Machine learning (ML) approach plays significant roles in spam account detection on microblogging social networks. ML incorporates two main methods: supervised and unsupervised learning. Supervised learning analyzes training samples and generates a classification model for predicting new user. Unsupervised learning, also known as clustering method, differs in the sense that no labeled data is present during the training stage, and the algorithm learns from the data itself by identifying similarities among the instances. One of the advantages of supervised and unsupervised machine learning approaches is that they provide opportunity to study different features for spammer detection, which are encoded to train machine learning algorithms. However, the major challenge with these methods centers on the evasion tactics posed by spammers to avoid detection.

Using supervised learning, Chu et al. [15] combined both content and behavioral features to distinguish spam from legitimate campaigns using Random Forest algorithm. Aggarwal et al. [3] combined different categories of features based on content and user profile information to build a tool called PhishAri, which is capable of identifying tweets with malicious URLs on Twitter. Martinez-Romo and Araujo [31] combined language and content based features to train SVM classifier. Liu et al. [28] and Zheng et al. [49] studied spam account detection in Sina Weibo microblogging social network. Sina Weibo is the most popular microblogging network in China with more than 500 million users. The authors studied different content and user profile information to train machine learning algorithms. Benevenuto et al. [10] applied SVM algorithm to detect spammers on Twitter. The authors identified spammers’ characteristics related to tweet contents and user-behavior to separate spam and legitimate accounts.

In unsupervised ML method, Egele et al. [18] developed COMPA, a system that exploits statistical model and anomaly detection technique to identify compromised accounts on social networks. The system extracts different features from user’s messages, such as time of the day, message source, message text, message topic, message link, and direct user interaction. A behavioral model is built for each category of feature and a global threshold value is computed for all the models. Therefore, any new message from the same user that violates these behavioral characteristics is considered malicious. Lee and Kim [26] proposed hierarchical clustering approach to initially group spammers with malicious profile names. They trained Markov chain model with valid account names identified from Twitter. Thus, an account is flagged as malicious if the account name deviates from identified patterns of legitimate account names. In addition to unsupervised learning approach, the researchers trained SVM algorithm using different name-specific features based on the clusters identified by the hierarchical model. The main issue with this approach is that spammers can launch evasion tactics to generate account names that mimic legitimate account. To evade existing spam account detection models that rely on the number of followers and tweets, spammers purchase fake followers and tweets from different underground markets [44]. For instance, a platform such as Intertwitter (http://intertwitter.com/) offered 10,000 fake followers accounts at the rate of $79, giving spammers the opportunity to embed themselves within the network of legitimate users. Fake accounts are now offered in large volumes, varying from thousands to millions [47]. These bogus accounts and their fake links are infringing on the normal social network trust and disrupting the media for effective social communication.

Motivated by the challenges inherent in spam message detection filters due to the limited size in message length and the evasion of features identified for spam account detection, this paper proposes a unified framework that is cable of detecting spam message and spam account in short message communication media using a minimal number of features. To achieve these objectives, the proposed framework exploits a slightly different approach from the traditional contents spam analysis that is based on bag of words model. The subsequent section provides a detail discussion on the proposed method in this paper.

3 Methodology

The proposed unified framework targets spam message and spam account detection using Twitter and mobile data as test samples. Twitter is a microblogging online social networking service, which enables users to post and read short messages usually known as tweets. Tweet can be embedded with entities such as hashtag, mention, and shortened URLs [14]. Users on Twitter microblog utilize hashtag to group tweets according to topics such as the case of #RioOlympics2016, which is a popular topic discussed on Twitter during the 2016 Rio Olympic Games. A topic can be categorized as trending, if it receives many attentions from the users on the network. For example, #JustinBieber is one of the popular topics in 2011 on Twitter. Mention feature uses the “@” symbol to indicate the users who can receive tweet directly on their timelines. Studies have shown that spammers employ mention tool for target attack since the Twitter microblog featured a unidirectional user binding [24]. Although Twitter has introduced features to deactivate unsolicited mention, however, a majority of users on Twitter still utilize default account settings. The visibility of a tweet on the network is increased through a process of re-tweeting. Re-tweeting a user’s tweet has been identified as another strategy used by spammers to keep their accounts running [27]. In addition, spammers’ accounts exhibit automated posting patterns since there is a need for spammers to get across to a large number of users on the network [14]. Even though Twitter microblogging social network has become an important platform for real-time communication [5], however, it has gone through several cases of abuses in the hands of social spammers. This is evidence in the rules introduced by Twitter to suspended account with abusive behaviors [41].

On the other hand, mobile users can communicate using short text messages, which are delivered by the message center. The proposed framework can be implemented at the message center to provide a central SMS spam message filter. The framework can also be implemented on microblogging social networks to provide a robust classification system. The assumption upon which the proposed framework is based is that spammers will find it difficult to evade both spam message and spam account detection models at the same time. Thus, combining these two classification models will provide efficient spam filter. To reduce the spread of spamming activities on Twitter and mobile communication media, we propose a unified framework shown in Fig. 1.

The input to the framework can originate from either microblogging networks or mobile phone where in the case of microblogging network the user’s screen name or ID is provided to the proposed system. The system collects both contents and network information around the user’s social connection. Both the content and network data are passed to the processing and feature extraction phase. The system extracts features from five categories of features: user profile, content, network, timing, and automation, which are used to detect spam account on Twitter microblog. In the case of spam message detection on Twitter, the content data is passed to the preprocessing module where the text is represented using a minimal number of features. This representation stage is discussed further in the feature analysis section. In the case of input from mobile platform, the text message is passed to the preprocessing module where it is also represented using a minimal number of features similar to the case of Twitter spam message. The features extracted from the various components are passed to the machine learning predictive models that have been pre-trained for both spam message and spam account detection to identify the class category. During the training phase of the spam account detection, the framework incorporates evolutionary search algorithm to provide a minimal number of features for spam account detection model. The results obtained from the different experiments reveal that the proposed unified framework is promising for detecting both spam message and spam account in short message communication media.

3.1 Data collection

In order to evaluate the proposed unified framework for spam message and spam account detection, the data collection stage is divided into two parts. The first part deals with the data used for spam message detection and the second part focuses on how we collect data for spam account detection.

3.1.1 Datasets for spam message detection

Table 1 shows the statistics of the three datasets used to evaluate the proposed framework for spam message detection. The first two datasets, SMS Collection V.1 and SMS Corpus V.0.1 Big, hereafter refer to as Dset1 and Dset2, are publicly available and have been used in ([16, 19] respectively. The third dataset, Twitter Spam Corpus, hereafter refers to as Dset3, is a collection of messages selected from the tweets posted by over 7000 users identified based on the Twitter suspension algorithm. Each of these datasets is discussed further in the subsequent sections.

Table 1 Dataset statistics for spam message detection

SMSAD: a framework for spam message and spam account detection

Abstract

Similar content being viewed by others

Machine intelligence based hybrid classifier for spam detection and sentiment analysis of SMS messages

Non-email Spam and Machine Learning-Based Anti-spam Filters: Trends and Some Remarks

A Comparative Study of Spam SMS Detection Techniques for English Content Using Supervised Machine Learning Algorithms

Explore related subjects

1 Introduction

2 Related work

3 Methodology

3.1 Data collection

3.1.1 Datasets for spam message detection

SMS spam message datasets

Twitter spam message dataset

3.1.2 Dataset for Twitter spam account detection

3.2 Features analyses

3.2.1 Spam message detection features

3.2.2 Spam account detection features

4 Experimental setup

4.1 Evaluation measure

4.2 Results and discussions

4.2.1 SMS spam message detection

4.2.2 Microblog spam message detection

4.2.3 Microblog spam account detection

4.2.4 Evolutionary search method for feature reduction

5 Model evaluation

5.1 Performance of SMS spam message detection models

5.2 Performance of twitter spam message and spam account detection models

5.3 Comparison of spam message detection models

5.4 Comparison of spam account detection models

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation