1 Introduction

Spam became prevalent in the late 1990s and early 2000s when email was considered to be the primary tool for information exchange among individuals and firms. With the introduction of email spam filters, spammers have started looking at other platforms for better payoffs. One of these “money-making” platforms for spammers are social media sites (SMSs) that play an increasingly important role in our daily lives [1]. Nowadays, online social media data exhibits the 4Vs characteristics that are often used to describe Big Data, namely volume, velocity, variety, and veracity [2]. In terms of volume, the number of active users on Facebook and Twitter has reached, respectively, 1.55 and 0.32 billion in November 2015.Footnote 1 There are over 500 million tweets generated on a daily basis.Footnote 2 Besides signifying its importance in our daily lives, the features of online social media (i.e., creation and exchange of user-generated content, support for collective actions, and facilitation of diverse social interactions) denote its indispensable function of being a business tool for promoting e-commerce products and services. Thus, increasingly more e-commerce retailers choose online social media as a main marketing platform or a new “social CRM” tool that fosters instant interactions with potential consumers.

SMSs likewise provide spammers with unprecedented opportunities to launch various attacks. Spammers perform deceptive acts [3, 4], conduct unfair trading activities [5, 6], and even make illegal profits [7] by posting social spam on SMSs. Social spam refers to low-quality information for which users do not ask or specifically subscribe to [8]. Social spam is used to launch phishing attacks [9], promote adverse websites [10], distribute malwares [11], and spread adverse messages [12, 13]. Embedded URLs in social spam direct users to adverts, malware, or pornographic websites (see Fig. 1). According to Nexgate’s state of social media spam report, there has been a 355 % growth of social spam in the first half of 2013 [14]. As well it has been revealed that more spammers were found on Facebook and YouTube than any other SMS. Grier et al. [15] reported that 8 % of the 25 million URLs posted on Twitter were phishes, malware, and scams.

Fig. 1
figure 1

Snapshot of social spam about Starbucks found on YouTube

The embedded URLs or deceptive contents in social spam gradually compromise consumers’ trust, patience and satisfaction, and even worse, lead to leakage of personal information or monetary loss [16]. A survey conducted by Maritz Research revealed that out of 3400 individuals who referred to review sites, such as Yelp and Trip Advisor, only half trusted the reviews they had read [17]. Moreover, the Federal Bureau of Investigation and the American National White Collar Crime Center report that the monetary losses from scam websites reached $240 million in 2008 [18]. Consequently, consumers’ trust in user-generated content on SMSs is decreasing.

Businesses leveraging social media to promote their products or services to consumers lose potential sales due to the fact that the competitors may take advantage of social spam to inflate their brand popularity. According to a research conducted by Harvard Business School, one star increase of restaurant’s rating on Yelp leads to around 9 % increase in the revenues of the corresponding restaurant [19]. In fact, generating social spam has already been proven as an effective revenue stream for search engine optimization and public relations firms. For example, they intentionally post positive reviews or comments on SMSs to improve the reputation of their clients and in the same manner act adversely to deteriorate the reputation of their clients’ competitors.

SMSs, such as Yelp, Facebook, and YouTube, carry the burden to filter and prevent social spam to enable consumers and firms to find accurate information or extract accurate market intelligence. Recently, CNN reported that TripAdvisor (Italy) was levied a fine of €500,000 by the Italian Competition Authority (ICA) for unfair trade activity and misleading consumers, despite the fact that certain review filters had been implemented.Footnote 3 In 2013, Twitter spent approximately $700,000 to prevent social spam. However, the existing spam filtering methods seem ineffective given the large number of spamming cases reported in press in recent years; they are far from perfect.Footnote 4 Furthermore, social spam leads to a huge waste of system resources, such as the bandwidth and disk space, which is a severe problem in this era of Big Data.

Previous studies use different features (e.g., user-, text, graph-, and social network-related attributes) and classification algorithms (e.g., Naïve Bayesian and Bayesian Network) to design frameworks for detecting and reducing social spam on many social media platforms (e.g., Facebook, Twitter, Sina Weibo, Myspace, YouTube, and Flickr). However, to the best of our knowledge, previous studies have not exploited both probabilistic topic modeling and incremental learning for detecting social spam on SMSs. Our main contribution is design of a novel methodology that integrates word-, topic- and user-based features, and applies labeled latent Dirichlet allocation (L-LDA) and incremental learning to improve the accuracy of social spam detection on SMSs. More specifically, the proposed computational method is underpinned by incremental learning and L-LDA [20] to mine latent topics describing the inherent semantics of social spam. The L-LDA is a supervised variant of latent Dirichlet allocation (LDA)—a topic modeling method originally developed by Blei et al. [21]. This probabilistic topic modeling assumes that each document is a mixture of various topics and each topic is characterized by a set of words with a high probability of co-occurrence. The latent topics can be transformed to topic-based features for classification and combined with word-based features.

Our proposed methodology applies the most discriminative words identified by a χ2 test to label topics. We perform a comparison among the three types of features, namely, word-, topic-, and user-based features through rigorous empirical experiments on a dataset of YouTube comments. Furthermore, we empirically verify the interpretability and discriminatory power of each spam topic extracted through the L-LDA model, which has not been reported in literature before.

The rest of the paper is organized as follows. In section two, we review previous studies on features and classifiers for social spam detection, topic modelling for spam detection, and incremental learning to identify the research gap. Section three and four illustrate the proposed methodology for social spam detection and present the research hypotheses. Section five describes our experiments that were conducted based on a dataset of YouTube comments. In section six, we discuss the results. Finally, in section seven, we conclude this paper by elaborating on our contributions and proposing directions for future research.

2 Related work

2.1 Features and classifiers for social spam detection

State-of-the-art machine learning algorithms, especially those incorporating supervised learning techniques, are the most common practices for detecting social spam. The general procedures of these practices include: (1) extracting features from spam messages; (2) applying the extracted features to train a classifier; and (3) performing a classification through the trained classifier. Previous studies have adopted a variety of features and classification algorithms for social spam detection on SMSs. Markines et al. [22] developed six kinds of features for detecting social spam: (1) TagSpam—measures the tags’ use and combinations that are statistically unlikely to appear in legitimate posts; (2) TagBlur—measures the semantic blur such as number of high frequency tags; (3) DomFp—estimates the likelihood that the content of a tagged page is generated automatically through structural similarity with other pages and their body of annotations manually labeled as spam; (4) NumAds—measures the number of times an ad server appears in a Web page tagged by a user; (5) Plagiarism—measures the number of results returned by a search engine, excluding the originating resource’s URL; and (6) ValidLinks—measures the number of user profiles created for spam purposes. These features were integrated with AdaBoost classifier to capture the properties of social spam in a public dataset from BibSonomy.org.

Lee et al. [23] proposed a honeypot-based approach for detecting social spammers on SMSs. Initially, spammer behaviors on Myspace and Twitter were studied to develop a set of features (i.e., tweets similarity, material status, and number of friends), which were then empirically tested using 60 different classifiers with a bag-of-words model to represent the text-based features. Lastly, the developed classifiers were applied to datasets in-the-wild, which provided support for the effectiveness of social honeypots as social spam detectors. Wang et al. [8] developed a social spam detection framework for multiple SMSs such as Facebook, Myspace, and Twitter. The framework includes: (1) mapping techniques for converting network specific objects to framework-defined standard model of an object; (2) fast-path techniques, such as blacklists, hashing, and similarity matching, for pre-filtering by checking incoming objects against spam; and (3) classification technique, such as Bayesian, for classifying spam or non-spam objects. Associative classification was adopted to strengthen the cross social-corpora classification. Jin et al. [24] designed a social spam detection framework for Facebook, which uses GAD clustering algorithm for large scale clustering and integrated active learning algorithm for scalability and real-time spam detection. This framework has three types of features: (1) image content such as color histogram, color correlogram, CEDD, Gabor features, edge histogram, and SIFT; (2) text such as caption, description, comments, and URLs; and (3) social network such as user characteristics and behaviors.

Lin and Jia [25] adopted three types of features to detect social span on Sina WeiboFootnote 5: (1) lexical—measures the difference in behaviors of spammers and legitimate users; (2) status—measures the outlink URLs, length of login, nature of topics, use of emotions, and reposting patterns; and (3) user—measures the number of user’s followers and users. The developed classifiers incorporated Naive Bayesian algorithm, logistic regression, and support vector machine (SVM). Dae-Ha et al. [26] adopted social network feature, such as request reject ratio, request acceptance ratio, personality commonness, same community, and friend’s friend, to train a Bayesian Network classifier for detecting social spam on SMSs. Po-Ching and Po-Min [27] applied a J48 decision tree algorithm to analyze features, such as URL rate and interaction rate, for detecting spam accounts on Twitter. Sureka [28] proposed an effective method for detecting social spam in YouTube comments, which mines activity logs of users to extract patterns such as average time difference between comments, percentage of comments, comment repeatability across videos, and comment repetition and redundancy.

2.2 Topic modelling and spam detection

Previous studies adopted various types of features—user-, text-, graph-, and social network-related attributes—to detect social spam. These attributes are low-level features in comparison to the high-level features such as topic-based features. Topic-based features generated from topic models serve as a more abstract representation of documents. However, the application of topic-based features to detect social spam has received relatively little attention from researchers. Indeed, spam detection performance might be improved by leveraging a high-level representation. Topic models, especially variants of the LDA model, are becoming increasingly popular in different research areas such as information retrieval [29], bioinformatics [30], and image classification [31]. Therefore, some researchers attempted to apply LDA for spam detection and extraction of user-generated opinions from SMSs. Bíró et al. [32] adopted the novel multi-corpus LDA, an extension of the classical LDA model, for web spam detection. LDA was also used to track the trend of online opinions. Cui et al. [33] proposed an incremental Gibbs sampling algorithm to train the LDA model, and hence to track and observe the trends of topics being discussed online. Sizov [34] developed a framework for Web 2.0 content characterization with spatial awareness, which integrates Bayesian statistical models to explicitly describe spatial coordinates jointly with tag co-occurrence patterns. The proposed model is an extension of the classical LDA model, which besides the LDA-like tag generation process also integrates topic-specific normal distributions to describe the location (i.e., latitude and longitude). The content categorization, clustering, and tag recommendation capabilities of the proposed model were tested on dataset from Flickr.

2.3 Incremental learning as a new learning paradigm for social spam detection

Social spam detection is an evolving phenomenon, which implies the constantly-changing nature of the underlying dataset and the extracted features. As new types of social spam constantly emerge, classifiers should be retrained using existing and new training examples. Obviously, classifier retraining is time-consuming as well as keeping all the available training data wastes a large amount of storage space [35]. Consequently, traditional machine learning techniques without retraining may not be effective for detecting social spam on SMSs. However, a new learning paradigm, such as incremental learning, might be a better solution. Incremental learning is a new machine learning paradigm in which a classification model can be refined based on new training examples rather than retraining the model using the entire dataset. In fact, there are several popular incremental learning algorithms such as candidate elimination [36, 37], Cobweb [38], ID5 [39], and ILA [40]. Furthermore, there are some traditional algorithms, such as SVM [41] and logistic regression [42], that can be extended as incremental classification models to meet the social spam detection requirements.

2.4 Research gap

Previous studies use various features (e.g., user-, text-, graph-, and social network-related attributes) and classification algorithms (e.g., Naïve Bayesian and Bayesian Network) to design frameworks for detecting social spam on SMSs (e.g., Facebook, Twitter, Sina Weibo, Myspace, YouTube, and Flickr). However, to the best of our knowledge, none of the previous work reported in literature has exploited both probabilistic topic modeling and incremental learning to detect social spam on SMSs. For our work, thus, we integrate word-, topic-, and user-based features, and apply L-LDA and incremental learning to enhance the performance of social spam detection. We also conducted a rigorous experiment on dataset from YouTube comments to test the effectiveness of our proposed framework for detecting social spam.

3 The proposed methodology for social spam detection

In this section, we illustrate the proposed methodology for social spam detection on SMSs (see Fig. 2). The proposed methodology leverages three types of features, namely, word-, topic-, and user-based features to detect social spam. χ2 test and tf-idf scheme are adopted to conduct feature extraction and selection for the word-based feature. Discriminative words with high χ2 scores are regarded as the topic labels which are taken as part of the inputs to the L-LDA model. L-LDA is then applied to extract latent topics from user-contributed comments such as documents. The normalized topic frequencies are subsequently extracted as the proposed topic-based features. User behavior analysis is conducted to compute each user’s average time interval of posting (ATI) comments and the average similarity (AS) between two adjacent comments. These are taken as the proposed user-based features. Finally, a series of incremental classifiers are built to classify spam and ham such as legitimate user comments.

Fig. 2
figure 2

The proposed methodology for social spam detection

3.1 Feature engineering

Feature engineering is crucial to the performance of a classification task, and therefore we elaborately construct our feature set (see Table 1).

Table 1 Proposed feature set

3.1.1 Word-based features

In the field of information retrieval, researchers attempt to identify effective means of representing documents from a collection and efficiently processing large collections. Accordingly, we adopt the famous vector space model as the foundation for several text processing tasks such as document clustering and classification. Meanwhile, we also exploit the inner statistical information for user comments as much as possible because this information is crucial for several tasks such as clustering, classification, information retrieval, and summarization (see Fig. 2).

To meet the aforementioned requirements, the popular tf-idf scheme [43] is adopted for document representation. The tf-idf scheme has some appealing properties. It offers a simple representation of documents by computing the weights of words appearing in a document. As a typical feature extraction method in text categorization [44, 45], features extracted based on this weighting scheme are discriminative and powerful for many classifiers such as KNN, SVM, and Rocchio [46, 47]. Therefore, we apply the tf-idf weighting scheme as modified by Singhal et al. [48] as the basis to represent word-level features:

$$w_{i,d} = \frac{{1 + \ln \left[ {1 + \ln \left( {tf_{i,d} } \right)} \right]}}{{1 - b + b \times \frac{\left| d \right|}{avdl}}} \times \log \frac{N + 1}{{df_{i} }}$$
(1)

where b is set to a default of 0.20, and \(tf_{i,d}\) is the term frequency of term i in document d. |d| is the document length; avdl is the average document length across the corpus. df i is the number of documents that contains term i, and N is the size of the corpus.

Although the tf-idf scheme has been successfully applied to document representation, a huge corpus vocabulary tends to produce many sparse vectors under the environment of Big Data. These sparse vectors incur unnecessary computational costs and may hamper classification accuracy due to the large number of missing values along the high dimensional feature space. If we utilize an effective feature selection method to eliminate the uninformative words (features), we can improve the computational efficiency and potentially improve the classification performance. Accordingly, we apply the χ2 test to conduct feature selection in the word set.

χ2 test is one of the most widely used metrics for feature selection in text classification [4951]. χ2 test is used as “(1) a goodness-of-fit test between a group of data and a specific probability distribution, or (2) a test for the degree of dependence or association between two factors or variables” [52]. χ2 test is grounded in the information theory, which “tries to capture the intuition that the best terms for the class c are the ones distributed most differently in the sets of positive and negative examples of c” [53, 54]. After running a χ2 test, the χ2 score χ 2 is computed for each word. The value of χ 2 represents the association or dependency between the word and the spam class. The higher χ 2 score a word has, the more discriminative the word is. Since the size of the corpus vocabulary is very large, we select only the most distinctive words according to the χ2 score to reduce the dimensionality of the feature space. This is one of the ways to alleviate the computational costs of Big Data.

3.1.2 Topic-based features

The proposed feature selection method can overcome some of the shortcomings, such as very sparse document vectors, of the tf-idf document representation approach. However, this method fails to reveal the intra-document statistical structure. Adopting the “bag-of-words” assumption allows documents to be viewed and decomposed from a micro-level, which implies that the semantic information among words may be lost. Therefore, a macro-level view of the documents is also considered to deal with such a problem.

The systemic functional linguistic theory (SFLT) is a mechanism for representing texts, and its language has three meta-functions such as ideational, interpersonal, and textual [55]. The ideational meta-function represents a theory of human experience that pertains to the aspects of “mental word,” including attitudes and desires [56], and can be applied to several types of information such as topics, emotions, and opinions [57]. When the representation of texts reaches this level, the semantic information among words can be extensively retained. Blei [21] made a noteworthy improvement in this area of research by proposing one of the earliest probabilistic topic models, namely LDA. This model can identify how certain topic patterns are mixed in a document. LDA assumes that the entire corpus has k number of topics, and the content of each document focuses on these k topics. A document is regarded as a mixture of topics with different probabilities, and a topic represents a unique sequence of words based on their probabilities to occur (see Fig. 3). θ d,z denotes the probability that topic z occurs in document d and β z,w denotes the probability that a word w occurs in a topic z.

Fig. 3
figure 3

An example of the hierarchical organization of a document

L-LDA is a supervised variant of the LDA probabilistic graphical model [20]. It differs from LDA in the sense that it can automatically learn the latent topics in documents of a training set based on the given topic labels, and predict the occurrences of the defined topics in previously unseen documents of the test set. Assuming that we approximate a total of k topics of a corpus; for each document d, topic label Λ d  = (l 1 , l 2 , …, l k ) will be generated in a certain manner. Each l k  ∊ {0, 1} represents whether the document is related to the kth topic or not. Changing the value of the topic label for each document allows us to alter the relevance between the document and the topics of interest. Therefore, the topics obtained from running L-LDA are definitely relevant to the documents. To help interpret the meaning of topic labels, the topic label can be compared to a few centroids of topic clusters to help interpret the meaning of the topic model. Meanwhile, the L-LDA automatically clusters words around some centroids to form interesting topics (see Fig. 4). LDA and L-LDA differ from each other in terms of the topic mixture distribution—θ—that is decided by both binary topic presence indicators Λ d and the topic prior α. In L-LDA, Λ d can help filter topics sampled from a Dirichlet distribution over α and ensure the topics learned at the end could be mapped to the topic labels.

Fig. 4
figure 4

Graphical model of L-LDA

The L-LDA model characterizes a document by a mixture of topics in which each topic has different probabilities and word distributions. Hence, two distributions exist: document-topic and topic-word distributions. It is assumed that both distributions to have a Dirichlet prior. For every document dϵD, the distribution θ d on topic set Z is sampled from Dir(α). For each topic zϵZ, distribution β z on vocabulary set V is sampled from Dir(η). Hence, for the nth word in document d represented as w d,n , topic assignment z d,n can be iteratively calculated with Gibbs sampling [58]. After obtaining topic assignment z, document-topic distribution θ d,z and topic-word distribution β z,w are estimated as follows:

$$\theta_{d,z} = \frac{{N_{d,z} + \alpha }}{{\mathop \sum \nolimits_{z = 1}^{|Z|} N_{d,z} + \left| Z \right|\alpha }}$$
(2)
$$\beta_{z,w} = \frac{{N_{z,w} + \beta }}{{\mathop \sum \nolimits_{w = 1}^{|V|} N_{z,w} + \left| V \right|\beta }}$$
(3)

where N d,z represents the number of words assigned to topic z in document d, and N z,w represents the frequency of word w assigned to topic z in the corpus.

However, generating topic label Λ d for each document and defining the topics of interest remain a challenging issue. This is the reason why L-LDA has not been widely used in prior research. We believe some topics should differentiate spam from ham in our corpus. Topics should have a certain level of discriminative power. Thus, χ2 score χ 2 can be treated as a measure of the discriminative power of a word. A topic label for a document is generated based on the occurrence of the most discriminative words in that document. For document d, each l n in its topic label Λ d  = (l 1 , l 2 ,…,l k ) can be

$$l_{n} = \left\{ {\begin{array}{ll} {1,} & {\forall i \in d,n = Rank(\chi_{i}^{2} ) \le k} \\ {0,} & {otherwise} \\ \end{array} } \right.$$
(4)

where Rank(χ 2 i ) indicates the rank of the χ2 score of word i in document d. We can assume that the latent topics generated by L-LDA tend to be as discriminative as their centroids for spam detection.

Considering the discriminative words as topic labels for each document, a multi-labeled corpus can be successfully generated as an input to the L-LDA model. Then the L-LDA model will generate a topic assignment z d,n for each word i in document d and further arrive at N d,z . For topic z, its frequency in document d can be represented by N d,z . Then, we selected normalized topic frequency as the feature for spam detection. For document d, the normalized frequency of topic z can be computed as follows:

$$t_{d,z} = \tau + \left( {1 - \tau } \right)\frac{{N_{d,z} }}{{\mathop {\hbox{max} }\limits_{z \in Z} N_{d,z} }}$$
(5)

where τ is a value between 0 and 1. For document d, its topic-based feature is T d  = [t d,1 , t d,2 ,…, t d,k ].

3.1.3 User-based features

Identifying spammers can also contribute to detecting social spam since the majority of spammers share some behavior patterns. In some recent studies, researchers found that spammers tend to produce deceptive contents in a relatively short time interval, and their contents tend to be similar [28, 5961]. It is suggested that spam accounts have a “bursty” property, indicating that spam comments can be densely populated in a short period of time [60]. One explanation for this phenomenon might be that spam messages are automatically created via bot accounts that reposting similar contents. Based on the latter, Gao et al. [61] extracted the absolute time interval between consecutive wall events to detect social spam on Facebook. Chen et al. [62] found that average interval of posts could achieve a high precision in detecting sock puppets on Chinese news websites. Besides, content similarity is also found to be effective in capturing fake reviewer groups on Amazon [63] and spammers on Twitter [23, 64, 65].

Accordingly, we take into account ATI and AS of two adjacent comments posted by the same user as our user-based features. For user u who post N u comments, the ATI of user u is computed as follows:

$${\text{ATI}}\left( {\text{u}} \right) = \frac{{\mathop \sum \nolimits_{k = 2}^{{N_{u} }} (t_{u,k} - t_{u,k - 1} )}}{{N_{u} }}$$
(6)

where t u,k represents the posting time of the kth comment of user u. Similarly, the AS of user u is defined as follows:

$${\text{AS}}\left( {\text{u}} \right) = \frac{{\mathop \sum \nolimits_{k = 1}^{{N_{u} }} sim(c_{u,k} ,c_{u,k - 1} )}}{{N_{u} }}$$
(7)

where c u,k is the kth comment of user u and function sim() is a similarity function that computes the similarity between two comments. We use the approximate string matching algorithm that works by identifying the smallest number of edits required to change one string into another one [66, 67].

3.2 Incremental learning

Incremental learning, also called online learning, is a popular machine learning method. It allows classifiers to learn newly emerging training instances without going through the whole training set. It does not require a large number of training examples at the beginning and the performance of a classifier can continuously improve by learning some new training instances. Moreover, when the learning target keeps evolving in a dynamic environment, an incremental learning approach can better adapt to the changes and capture the new trend. Therefore, it is particularly suitable for complex or evolving tasks such as fraud detection [68] and spam detection [69, 70].

Incremental learning is performed in a sequential manner; at round t, a learner is given an instance x t from the dataset and then predicts an outcome p t . Here, p t refers to a machine-generated spam indicator for the instance of x t . Once the machine predicts the outcome p t , it will obtain a feedback by computing the loss function l(p t ,y t ) to compare the outcome p t and the manual spam label y t . The loss function l(p t ,y t ) is a measurement of the discrepancy between the machine-generated spam indicator p t and the manual spam label y t . Then based on the loss function l(p t ,y t ), the hyperplane w t for classification will be updated (see Fig. 5). It is worth emphasizing that when the classifier makes a prediction for tth trial, it will only use the instance x t to adjust the current classifier in the previous trials order to get p t . This makes it possible for the classifier to evolve together with the new batches of data instances. Also, for different algorithms, the updating rules for the hyperplane w t are different.

Fig. 5
figure 5

An incremental learning paradigm

In our work, we adopt four incremental algorithms to accomplish the learning task. The incremental algorithms include SVM [41], classical perceptron [71], relaxed online maximum margin algorithm (ROMMA) [72], and logistic regression [42]. For perceptron, SVM and logistic regression, the updating rule is mainly based on the stochastic gradient descent (SGD) method [73, 74]. The SGD method has several advantages when it is compared to the gradient descent (GD) method. For the SGD method, only one training example is used to estimate the gradient of the target function in each iteration. Thus, it will converge to the global minimum faster. This characteristic is very crucial for a large training dataset. For ROMMA or passive-aggressive perceptron algorithms [75], their updating rule is to obtain a hyperplane that can correctly classify the previously seen examples with a maximum margin. It only uses one example at a time to update the hyperplane. From the two aforementioned updating rules, we can see that the practice of using one example at a time to converge to the optimal classifier makes the incremental learning method outperform the traditional machine learning algorithms. The incremental learning method can easily capture spammers’ new spamming patterns and tendency in the setting of evolving social media. As a result, it makes the proposed detection method continuously adapt to spammers’ possibly changing behavior. Moreover, it alleviates the problem of classifier training from an extremely large dataset, particularly under the environment of Big Data.

For the extraction of topic-based features for the new data, we adopted the “folding-in” heuristic proposed by Hoffmann [76], which is also in line with the core idea of incremental learning. For each new comment, we first assigned a random topic for each word and kept the topic assignments for the old corpus unchanged. Then we ran Gibbs Sampling on this new comment and update the document-topic distribution θ d,z and topic-word distribution β z,w accordingly. In such a way, we were able to generate topic-based features for new data promptly and update our L-LDA model without retraining the whole model. Thus, our proposed framework operates under the core logic of incremental learning.

4 Research hypotheses

It should be noted that little research has been done to investigate how the aforementioned features affect the performance of classification. The feature set adopted in our framework includes three types: word-, topic-, and user-based features. Therefore, we present our hypotheses to evaluate these features’ influence on the classification performance measured in terms of accuracy, precision, and recall. The overall accuracy measures the total percentage of correctly classified spam and ham. Precision is a measure for the percentage of correctly classified spam out of the spam set identified by the classifier. Recall assesses the detection rate of spam class, that is, the percentage of correctly classified spam out of the true spam set.

The topic-based features, which are obtained via the L-LDA model, can reveal the latent semantic structure of each social spam. We believe that adding this type of feature can increase the capability of the classifiers to detect spam based on the inherent semantics of spam contents. Therefore, we construct the following hypotheses:

H1a

Incorporating topic-based features into the original feature set improves the accuracy of classifiers.

H1b

Incorporating topic-based features into the original feature set improves the precision of classifiers.

H1c

Incorporating topic-based features into the original feature set improves the recall of classifiers.

Here, the original feature set refers to the set without the aforementioned features. The user-based features reflect the behavior of spammers to some extent and this type of features is objective. Thus, user-based features may enhance the performance of the classifiers. Hence, we hypothesize:

H2a

Incorporating user-based features into the original feature set improves the accuracy of classifiers.

H2b

Incorporating user-based features into the original feature set improves the precision of classifiers.

H2c

Incorporating user-based features into the original feature set improves the recall of classifiers.

The information provided by user-based features is typically external to the user comments, whereas topic-based features are internal of the user comments. Nevertheless, internal features might be more effective in social spam detection due to the definition of social spam, which refers to a low-quality information that users do not ask for or specifically subscribe to on SMSs. The latter definition implies that social spam is better defined based on the content level. Hence, the topic-based features tend to be better than the user-based features. That is:

H3a

Compared with user-based features, the topic-based features will help the classifier achieve better accuracy.

H3b

Compared with user-based features, the topic-based features will help the classifier achieve better precision.

H3c

Compared with user-based features, the topic-based features will help the classifier achieve better recall.

5 Experimental evaluation

5.1 Dataset

The dataset for experimental evaluation was made up of millions of YouTube comments about the most popular video clips. The comments were labeled as true spam or not, using either an official spam filter by YouTube or manually with the “Flag for spam” button available above each comment posted at a video page. We elaborately extracted the preceding features to classify labeled comments into spam or ham using incremental classifiers. Data collection began on October 31, 2011 [77, 78] (see Table 2).

Table 2 Properties of the dataset

The most appealing feature of the dataset was that the labels were created by the audience on YouTube when they were browsing the comments. Then, the spam filter on YouTube or the administrative staff of YouTube would verify the reliability of these labels. Although some spam comments were not tagged as spam because of the infeasibility of the manual verification of a large volume of comments, these errors were still acceptable for an experiment in terms of the large amount of traffic on YouTube every day. Moreover, except the label and comment, posting time, video number, and user ID were also included in each row of the dataset (see Table 3). Spam is regarded as a low quality information that is not subscribed by the user such as adversarial contents for online shopping sites or phishing sites.

Table 3 Spam examples in the dataset

5.2 Experimental design

5.2.1 Pre-processing

YouTube is a popular video streaming website on which people from different countries share video clips and post comments. According to the New York Times, YouTube is the second most searched website in the world [79], and approximately 100 h worth of video clips are uploaded every minute [80]. Basically, user comments are written in multiple languages. Our main focus is to detect spam contents written in English. Hence, the first step in the pre-processing stage was to filter non-English comments that left us with a total number of 3,492,590 English comments, including 304,092 labeled spam and 3,188,498 ham.

Then, some users’ comments were removed from the dataset because their total number of posted comments was smaller than a threshold. If a user posted less than the minimum number of comments, the user was excluded from the dataset. The features that were extracted include comparison of two adjacent comments by the same user, which to some degree reflect the long-term activity of a user. If the user posted too few comments, the discriminative power of the corresponding feature tends to be low. Besides, data preprocessing is a way to identify task-relevant data as well as reduce noisy and low-quality data [81]. For user group Un (group of users each of whom only post n comments), if n is too small those groups’ comments will provide less information than those with a larger value of n. Thus, we regard those groups of data as low-quality ones which should be eliminated in preprocessing. Here, we exhibit the dataset statistics before and after preprocessing (see Table 4). Thus, to make use of high-quality data for computation, we only reserve those data generated by user group U6+ in which each user at least has posted 6 comments. In the end, we have 78,965 users and 6240 videos in our cleansed dataset.

Table 4 Dataset Statistics before and after preprocessing

The training and test sets should be separated because we adopted a machine learning method for classifying spam. The two sets were separated using a time sequence of each user posting the comments. Further, we extracted the first two-thirds of the comments of each user as the training set and the remaining one-third as the test set based on the time stamp on each comment. In other words, the test set incorporated a time lag behind the training set. Under this setting, it is convenient for us to measure the performance of our proposed framework to predict the extent of spam comments in a following period.

Tokenization and stemming were also performed before generating the feature set. We relied on stemming which was one of the most common techniques in information retrieval to eliminate the basic variations of some words. In linguistic morphology, stemming is defined as the process of reducing inflected (or sometimes derived) words to their stem, base, or root form. Generally, a written word form and the efficiency of content-based spam filter can be significantly improved by stemming [82]. This process effectively eliminates the influence of different forms of the same word while word statistics are produced.

5.2.2 Feature extraction and selection

As previously mentioned, we have three types of features in our feature set: word-, topic-, and user-based features. We adopt tf-idf scheme for word-based features to calculate the weight for each word in each comment. We compute the ATI and AS of each user for user-based features. Topic-based features are slightly more complicated because topic labels are generated via χ2 test.

However, the cardinality of the word set is 211,278 after performing the χ2 test. The set has many misspelled words that appear less than 10 times among the one million comments. If every word is selected as a feature, the feature vector of each comment tends to be very sparse, which will affect both the efficiency and the detection performance. Thus, we set a minimum frequency of 50 for each word, and a word is removed from the feature set if its frequency is below the minimum. Once we set the minimum frequency limit, the size of the vocabulary decreased to 7921.

If the number of latent topics to be learned via L-LDA is set to k, then the top k discriminative words generated by χ2 test will be selected as the topic labels. After applying L-LDA to analyze the latent topic structure in each comment, the normalized topic frequency in each comment can be obtained. By including the user-based features ATI and AS, we obtain a total of k + 7923 features in our feature set.

5.2.3 Classification

We utilized a toolbox from Google named sofia-ml to implement incremental SVM, perceptron, ROMMA and logistic regression algorithms [83]. The following four different feature combinations were created to improve the performance for each feature type: (1) only word feature (W); (2) word + user features (WU); (3) word + topic features (WT); and (4) word + topic + user features (WTU). We attempted to measure and see how performance of our methodology would be improved by adding certain features. We assumed that the detection results were not only affected by the combinations of feature sets, but also the number of latent topics to be applied. Thus, the number of latent topics was set as k = 6, 10, 20, 50, 80, 100. Finally, except the W and WU groups, 2 × 6 experiment groups were established.

5.2.4 Evaluation metrics

The standard metrics used in evaluating the performance of the experimental models involved precision (PRE), accuracy (ACC), recall (REC), F1-measure (F1), and receiver operating characteristic curve (ROC) together with the area under the ROC curve (AUC). We introduced the statistical significance test to compare the performance of classifiers to avoid stochastic fluctuation problems in evaluation. Paired t-tests were adopted to test the statistical differences among the performance scores achieved via the cross validation procedure.

6 Experimental results and analysis

6.1 Overall performance

We performed the training and testing of 14 groups of classifiers using incremental learning methods, and measured the overall performance of each group (see Tables 5, 6, 7, and 8). The five basic performance metrics mentioned above were applied in each case. For WT and WTU groups, when topic quantity k = 10 or 20, detection performance seemed to be better. In terms of accuracy, precision, and recall, better performance was achieved by a group containing topic-based features rather than using only word- or user-based features.

Table 5 Performance (%) of the W group
Table 6 Performance (%) of the WU group
Table 7 Performance (%) of the WT group
Table 8 Performance of (%) the WTU group

Groups with topic-based features often achieved a higher precision as topic-based features could reveal a comment’s latent semantics that were utilized by a classifier to distinguish between spam and ham. Under certain situations, user-based features are external and sometimes unreliable because spammers could control and alter their behavior deliberately (e.g., writing spam in a different way to reduce AS). However, although the way to generate a spam may change, the inherent connections between spam topics and spam comments remains unchanged. Topic model could easily identify a new spam topic by taking advantage of the inherent semantics of spam contents and the previously trained topic-word distribution.

We found that SVM classifiers outperformed other classifiers in most cases, whereas the performance achieved by logistic regression classifiers was the closest to that of SVM when compared with perceptron and ROMMA (see Fig. 6). SVM performed the best in the WTU group with topic quantity k = 10 that achieved an accuracy of 91.17 % and a F1-measure of 78.43 %. By comparing the performance of all SVM classifiers in the WTU group, it revealed that topic quantity k = 10 was the optimal value (see Fig. 7). Detection performance tended to improves when topic quantity increased from the minimum, and slightly decreased when the topic quantity reached the maximum. By fixing k = 10, the ROC curves are depicted in Fig. 8. It reveals that the WTU group outperforms other groups. Moreover, classifiers with and without topic-based features lead to very different performance given that both the W and the WU groups, and the WT and the WTU groups are similar to the neighboring group, but dissimilar to each other with remote groups.

Fig. 6
figure 6

A comparison of classification performance with topic quantity k = 10

Fig. 7
figure 7

ROC curves for SVM classifiers with the WTU feature set

Fig. 8
figure 8

ROC curves for SVM classifiers with topic quantity k = 10

6.2 Hypothesis testing

For the hypothesis testing part, we used stratified sampling method [84] to divide our test dataset into 20 subsets and perform a classification of these subsets. The stratified sampling can ensure that spam and ham ratio is consistent in these subsets with those in the training set. Then paired t-tests were performed on the results from the 20 subsets to evaluate the performance of classifiers with various combinations of feature sets and hence to verify the hypotheses set out in section four (see Table 9).

Table 9 P-values for paired t-tests against ACC, PRE, and REC

In terms of accuracy, the results of all of the pairs are significant. Hence, H1a and H3a are supported at p < 0.001, whereas H2a is marginally supported because ACC(WTU) > ACC(WT) is only significant at p < 0.10. By comparing the results of the two groups related to H3a, we conclude that after introducing topic-based features, the effect of user-based features is weakened in terms of accuracy. Topic-based features are more effective than user-based features in improving detection performance as measured by accuracy, as stated in H3a.

In the following table about precision, only the result of PRE(WTU) > PRE(WT) is non-significant, whereas those of the others are all significant. Thus, H2b is partly supported, whereas H1b and H3b are supported at p < 0.001. Although PRE(WTU) > PRE(WT) is non-significant, it is oppositely significant, which means PRE(WTU) < PRE(WT) is significant at p < 0.001. This could be regarded a piece of evidence that user-based features are sometimes unreliable because spammers may deliberate alter their behavior, which might further undermine the contributions of topic-based features when they these two feature sets are combined.

For the recall-related table, REC(WU) > REC(W) and REC(WTU) > REC(WT), which support H2c at p < 0.001 that user-based features will improve the recall of classifiers. H1c and H3c are rejected but the opposite are supported. The results signal that user-based features can improve recall to some extent, at the cost of precision. It seems that classifiers with user-based features are more inclined to judge a comment as spam, which would decrease the precision and increase the recall. However, misclassification is possible to occur partly because spam labels in the dataset are based on the content-level and provided by users on YouTube. Another reason might be that spammers use their accounts to generate spam as well as legitimate comments. This could lead to some legitimate comments to be judged as spam as a result of the suspicious users and their characteristics. Besides, the reason why H1c and H3c are supported in another direction is that the unbalanced proportion between spam and ham might lead to the occurrence of large numbers of previously unseen legitimate words. These words are more likely to be involved in a ham topic assignment by the topic model because it is very possible that they are surrounded by legitimate words. The topic model has the ability to identify a new example of spam based on the previous topic-word distribution and the connection between spam topics.

In sum, the groups containing topic-based features outperform those only with word or user-based features in terms of accuracy and precision. The groups with user-based features have the tendency to detect spam and achieve a higher recall than the other groups. Our experimental results offer considerable insights to social marketers who want to identify a robust strategy to filter spam on SMSs. If the social marketers want their SMSs to be absolutely clean from spam and prefer a more efficient detection process, they should utilize more user-based features to filter spam. In contrast, if the marketers want to reduce the chance of misclassifying legitimate users as spammers, they should utilize more topic-based features to filter spam.

6.3 Feature analysis

To examine the discriminative power of the proposed feature sets, Fisher score [85] is applied. Fisher score is a measure of the degree of how an independent feature could distinguish two classes although it ignores the mutual information between the features. The features with high Fisher scores are AS as well as the discriminative words with high χ2 scores (the feature description is based on the χ2 score rank). In addition, ATI together with other topic-based features are likewise in the top tier (see Table 10). The extreme high value of the feature AS suggests that it is very common for spammers to keep posting near duplicated comments on YouTube.

Table 10 F-scores of discriminative features

To further analyze these features, we plot the discriminative power of these features based on the cumulative distribution function (CDF) for each selected feature (see Fig. 9). We found that all the selected word-based features (W1–W8, W10) are spam-related, which means that the larger their values are, the more likely the corresponding comments are spam. This is understandable because these are typical spam-related words according to the spam examples highlighted before (see Table 3). Besides, AS, T7, and T8 are likewise spam-related, whereas ATI, T3, and T6 are ham-related. For topic-based features, we list words with relatively high probabilities under the selected topics to verify our claim. It shows that the words under T3 are quite neutral and those under T6 seem to be commonly used in chatting or gossiping which also tend to be irrelevant with respect to spam (see Table 11). For the words under T7 and T8, they are likely to appear in some adversarial messages of online shops; the names of some celebrities, such as Jordan and Kobe, are captured by T7 and T8 because many promoters leverage the celebrities to attract people’s attention. By performing a POS tag analysis on those high probability words under the selected topics shown in Table 11, we found T6 has around 16 % interjections, which could support that T6 is about online chatting or gossiping. Besides, we also observed that T7 and T8 contained around 20 % more common nouns (NN), 12 % more adjective comparatives (JJR) and superlatives (JJS), and 6 % more proper nouns (NNP) on average when compared with T3 and T6. These findings are largely in line with Ott et al.’s work [86] which asserts that deceptive writing is usually done in an exaggerated language. Furthermore, we did get other findings by running the POS tag analysis; however, they were not relevant for our study.

Fig. 9
figure 9

CDFs of features with high Fisher scores

Table 11 Highly probable words of discriminative topics

According to our analysis, T7 and T8 are spam-related; such a finding is consistent with the aforementioned explanations (see Fig. 9). Finally, by directly inspecting the words with high probabilities for a topic, it is straightforward to infer the spam class of that topic. It suggests that the topics extracted via L-LDA are highly interpretable.

7 Conclusions

With the ubiquitous of the social web, there has been an explosive growth of user-contributed comments. Meanwhile, there has also been a growing concern about the wide spread of social spam embedded in user-contributed comments. Given the big volume of user-contributed comments on SMSs, there is a pressing need to develop novel methodologies and techniques to tackle social spam.

Previous studies use various features (e.g., user-, text, graph-, and social network-related attributes) and classification algorithms (e.g., Naïve Bayesian and Bayesian Network) to design frameworks for detecting social spam on SMSs (e.g., Facebook, Twitter, Sina Weibo, Myspace, YouTube, and Flickr). However, to the best of our knowledge, previous studies have not exploited both probabilistic topic modeling and incremental learning for detecting social spam on SMSs. Thus, the main contributions of our research are the design and evaluation of a novel social spam methodology which is underpinned by the L-LDA model and incremental learning. More specifically, we exploit word-, topic-, and user-based features to better represent social spam and leverage incremental classifiers, such as SVM, logistic regression, perceptron, ROMMA, to enhance spam detection performance. Based on several millions of user comments posted to YouTube, our experimental results show that the proposed methodology can achieve an average accuracy of 91.17 % and an average F1-measure of 78.43 %, respectively. According to our paired t-tests, topic-based features improve the overall accuracy and precision. However, they may hurt the recall of spam detection. In contrast, user-based features enhance the recall of spam detection, but it may hurt precision.

The managerial implication of our research is that e-commerce managers can apply the proposed methodology to alleviate the interferences of social spam, and hence to discover more accurate business intelligence from the big volume of user-contributed comments posted to SMSs. Moreover, our methodology can improve the daily administration of SMSs, and enhance the hygiene and efficiency of these sites in the era of Big Data. For social media marketers, the proposed methodology is a powerful tool to ensure that their marketing campaigns are free from the interferences of social spam. Finally, the proposed methodology helps users identify relevant information from SMSs and improve their loyalty to these sites.

For our future research, we will refine the existing topic-based features because some latent topics mined using L-LDA may be similar to each other. Further clustering of the latent topics mined via L-LDA should improve the quality of the topic-based features. We will also try to take the videos into consideration to come up with more effective features to detect spammers with small comment accounts so as to make up for the limitation in our work. Besides, additional features, such as spam diffusion patterns in a social network, will be explored to enhance social spam detection performance. Moreover, crowd sourcing will be explored to improve the quality of our evaluation dataset because we found some occasional mistakes of the “Spam Hint” judgments provided by YouTube. Furthermore, deep learning methods will be examined to learn other high-level features apart from topic-based features.