Do You Really Follow Them? Automatic Detection of Credulous Twitter Users

Balestrucci, Alessandro; De Nicola, Rocco; Petrocchi, Marinella; Trubiani, Catia

doi:10.1007/978-3-030-33607-3_44

Alessandro Balestrucci¹⁴,
Rocco De Nicola¹⁵,
Marinella Petrocchi^15,16 &
…
Catia Trubiani¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11871))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1762 Accesses
6 Citations

Abstract

Online Social Media represent a pervasive source of information able to reach a huge audience. Sadly, recent studies show how online social bots (automated, often malicious accounts, populating social networks and mimicking genuine users) are able to amplify the dissemination of (fake) information by orders of magnitude. Using Twitter as a benchmark, in this work we focus on what we define credulous users, i.e., human-operated accounts with a high percentage of bots among their followings. Being more exposed to the harmful activities of social bots, credulous users may run the risk of being more influenced than other users; even worse, although unknowingly, they could become spreaders of misleading information (e.g., by retweeting bots). We design and develop a supervised classifier to automatically recognize credulous users. The best tested configuration achieves an accuracy of 93.27% and AUC-ROC of 0.93, thus leading to positive and encouraging results.

Partially supported by the European Union’s Horizon 2020 programme (grant agreement No. 830892, SPARTA) and by IMT Scuola Alti Studi Lucca: Integrated Activity Project TOFFEe ‘TOols for Fighting FakEs’. It has also benefited from the computing resources (ULITE) provided by the IT division of LNGS in L’Aquila.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Friend or Foe: Twitter Users under Magnification

Changing Perspectives: Is It Sufficient to Detect Social Bots?

Identification of Social Accounts’ Responses Using Machine Learning Techniques

Keywords

1 Introduction

The diffusion of information on Social Media is often supported by automated accounts, controlled totally or in part by computer algorithms, called bots. Unfortunately, a dominant and worrisome use of automated accounts is far from being benign: malicious bots are purposely created to distribute spam, sponsor public characters and, ultimately, induce a bias within the public opinion [13]. Especially, their malicious activities are of high efficacy when performed on a targeted audience [4, 5] to, e.g., generate misconception or encourage hate campaigns [7]. Recent work in [24, 29] demonstrate that bots are particularly active in spreading low credibility content and amplifying their significance. Moreover, human-operated accounts contribute to the diffusion of disinformation by, e.g., retweeting and/or liking fake content.

In a previous work [3], the authors shed light on so called credulous Twitter users assuming, with a harmless abuse of language, that they refer to human-operated accounts with a high percentage of bots as friends. Unlike [3], where the authors performed an analysis involving the friends of a set of human-operated accounts - a highly time consuming task - here we design and develop a classifier to find out credulous Twitter users, by considering a number of features that do not take the friendship with bots into account. Starting by considering a set of features commonly employed in the literature to detect bots [10, 26], we end up with a lightweight classifier, in terms of costs for gathering the data needed for the feature engineering phase. The classification performance achieves very encouraging results – an accuracy of 93.27% and an AUC (Area Under the ROC curve) equal to 0.93.

We believe that automatically detecting credulous users is a promising line of research. Such an investigation could help researchers to: 1. better understand the characteristics of those users more polarized and/or more willing to be influenced; 2. unveil low-credibility and/or deceptive content and limite their online diffusion; 3. devise alternative strategies for bot detection by concentrating the analysis on the friends of credulous users; 4. improve the users’ awareness about threats to data trustworthiness.

The following section presents the approach for the automatic detection of credulous Twitter users, while Sect. 3 presents the experimental results. Section 4 discusses the outcome and suggests further investigations. Section 5 presents related work in the area, arguing on the differences, the contributions and the novelty w.r.t. our work. Section 6 concludes the paper.

2 The Approach

2.1 Datasets

We consider three publicly available datasets^{Footnote 1}: CR15 [10], CR17 [12], and VR17 [26]. From the merging of these three datasets, we obtain a unique labeled dataset (human-operated/bot) of 12,961 accounts - 7,165 bots and 5,796 humans. We use this dataset to train a bot detector, as described in Sect. 2.2. To this end, we use the Java Twitter API^{Footnote 2}, and for each account we collect: tweets (up to 3,200), mentions (up to 100), IDs of friends and followers (up to 5k).

The identification of credulous users follows the approach presented in [3]. To this end, we need to detect the amount of bots which are friends of the 5,796 human-operated accounts. Due to the rate limits of the Twitter APIs and to the huge amount of friends possibly belonging to these human-operated accounts, we consider only those accounts with a list of friends lower than or equal to 400 [3]. This leads to a dataset of 2,838 human-operated accounts, namely Humans2Consider hereafter. By crawling the data related to their friends, we overall acquire information related to 421,121 Twitter accounts.

2.2 Bot Detection

A bot detection phase is required to discriminate bots and genuine accounts in the dataset of selected friends. The literature offers a plethora of successful approaches [13]; however, also due to the capabilities of evolved spambots to evade detection [12], the performances of the diverse techniques degenerate over time [19]. Furthermore, some bot detectors are available online, but not fully usable due to restrictions in their terms of use, see, e.g., [8]. To overcome these issues, we design and develop a supervised approach, which mixes features from popular scientific work and novel features here introduced.

Regarding the features, we consider two sets. The first one derives from Botometer [26], a popular bot detector^{Footnote 3}. In addition to the original Botometer features [26], we also include: the CAP^{Footnote 4} (Complete Automation Probability) score, the Scores^{Footnote 5}, the number of tweets and mentions; we call Botometer+ this augmented set of features. The second feature set is inherited from [10], where a classifier was designed to detect fake Twitter followers. We use almost all their ClassA features^{Footnote 6}, except the one about duplicated pictures, because it was not possible for us to verify whether the same profile picture was used twice; we call ClassA− this reduced set of features. The conjunction of the two sets of features is referred in the following as ALL_features.

We use 19 learning algorithms to train our classifier (with a 10-fold cross validation) and we compare their classification capabilities with respect to the three feature sets (Botometer+, ClassA− and ALL_features). The classification performances are evaluated according to: percentage of accuracy, precision, recall, F-measure (F1), and Area Under the ROC Curve (AUC). On the most accurate classifier, Hyper-Parameter tuning is performed. The tuned classifier is then used to label the friends of the Humans2Consider dataset (see Sect. 2.1).

2.3 Identification of Credulous Twitter Users

The identification of credulous users can be performed with multiple strategies, since there are various aspects that may contribute to spot those users more exposed to the malicious activities of bots. In our previous work [3], we introduced a set of rules to discern whether a genuine user is a credulous one. These rules allow to rank users by relying on the ratio of bots over the user’s list of friends. Here, we inherit these rules to rank the users in our dataset (see Sect. 2.1), but further ranking strategies can be also considered. Our goal is to build a ground truth of credulous users to derive an assessed characterization of these accounts. Applying the approach defined in [3], we identified as credulous 316 users in Humans2Consider.This constitutes the input data for the next step. We note that the approach in [3] is very expensive in terms of data gathering. For example, for the investigated dataset, it requires 421k users’ account information and 833 million of tweets.

2.4 Classification of Credulous Twitter Users

Goal of this phase is to build a decision model to automatically classify a Twitter account as credulous or not. As ground-truth, we consider the 316 accounts identified as credulous according to the process described in Sect. 2.3.

We experiment the same learning algorithms and the same feature sets considered in Sect. 2.2, with 10 cross-fold validation. However, for credulous users classification, the learning algorithms take as input a very unbalanced dataset: we have 2,838 human-operated accounts (see Sect. 2.1) and, among them, 316 have been identified as credulous accounts (see Sect. 2.3). To avoid working with unbalanced datasets, we split the sets of not credulous users into smaller portions, equal to the number of credulous users. We randomly select a number of not credulous users equal to the number of credulous ones; then, we unify these instances in a new dataset (hereinafter referred to as fold). Then, we repeat this process on previously un-selected sets, until there are no more not credulous instances. Such procedure has been inspired by the under-sampling iteration methodology, for strongly unbalanced datasets [18]. Each learning algorithm is trained on each fold. To evaluate the classification performances on the whole dataset, and not just on individual folds, we compute the average of the single performance values, for each evaluation metric.

3 Experimental Results

All the experiments are performed with Weka [28], a tool providing the implementation of several machine learning algorithms. In the following, we present the main results obtained for bot detection and credulous classification, all the details are publicly available: https://tinyurl.com/y4l632g5.

The first column of Tables 1 and 2 shows the set of features considered for learning (i.e., ALL_features, Botometer+, ClassA−, see Sect. 2.2). The second column reports a subset of the adopted machine learning algorithms whose name is abbreviated according to the Weka’s notation and reported in the following:

The remaining columns report the evaluation metrics mentioned above.

Table 1. Results for bot detection

Full size table

Table 2. Results for credulous detection

Full size table

Regarding bot detection, Table 1 shows that all the machine learning algorithms well behave, regardless of the feature set. Random Forest is the one that performs best. When the set ALL_features is used (see the shaded line in Table 1): accuracy = 98.33%, F1 = 0.98 and AUC = 1.00; and after the tuning phase, we obtain a final accuracy = 98.41%.

Table 2 shows that ALL_features and ClassA− have good and quite similar classification performances, contrary to Botometer+. Both ALL_features and ClassA− demonstrate their efficacy to discriminate credulous users. On the contrary, the Botometer+’s features properly work for bot detection tasks only. Going into deeper details, in Table 2 we can notice that the 1R algorithm obtains the best accuracy percentage (93.27% with \(\sigma = 3.22\)) and F-score (0.93), but not the highest AUC (0.93). It is worth noting that the values of the 1R algorithm are exactly the same when considering ALL_features and ClassA−. This means that the algorithm selects ClassA−’s features only, the ones from Botometer+ are useless in this case. This is a relevant result since we recall that ClassA− features refer to the profile of accounts and it is less expensive to collect them.

4 Discussion

The results in Table 2 show the capability of our approach to automatically discriminate those Twitter users with a large number of bots as friends, namely credulous, without explicitly considering the features of the latter, which would imply a very high cost in terms of data gathering. To better understand this point, we recall that the approach in [3] for the identification of credulous users needs to crawl a large amount of data, due to the necessity of extending the analysis to the friends of a Twitter account. In the specific case under investigation, this means to retrieve information for more than 400k user accounts, 11 millions of tweet mentions, and more than 820 millions of tweets. As opposite, the credulous detector here proposed requires to gather the profile information of 2,838 accounts only. The classification performances are really promising, with the best accuracy 93.27%, best F1 0.93, best AUC 0.93. We remark that such results have been achieved by relying on so called ClassA− features only, i.e., features extracted from the account profile. It is peculiar how the features useful to discriminate credulous genuine accounts are features belonging to the account profile only. This preliminary result calls for three further investigations: 1. to compare the range of values assumed by these features when detecting credulous accounts with the one assumed to detect social bots (as in [10]); 2. to explore the reason why more complex features (such as the ones of Botometer) do not seem to give good results to find credulous users; 3. to perform a deeper analysis on the importance of each specific feature when discriminating credulous users, by means, e.g., of Principal Component Analysis [28]).

Finally, even if the design of a bot detector is not the primary goal of this paper, but only a mean through which we obtain the ground-truth for training the credulous user classifier, we notice that, compared to the performances reported in [12, 29], our bot detector achieves very good classification performances. This strengthens the robustness of the ground truth obtained in Sect. 2.3, since the friends’ nature evaluation is assessed by means of a very accurate classifier.

5 Related Work

Our work is related to all those approaches that investigate peculiar features of social networks users. We discuss the ones we find more relevant for our approach, with the caveat that the presented literature review is far from being exhaustive.

A survey on users’ behaviour in social networks is proposed in [16]: it is remarked that the recipients of shared information should be chosen, in a more precautionary way, by taking into account more real-life relationships and less virtual links. Our approach works exactly in the direction of enhancing the awareness of users, by classifying the ones more exposed to attacks of social bots.

Information spreading on Twitter is investigated in [20], where the authors demonstrate that the probability of spreading a given piece of information is higher when promoted by multiple sources. This supports our attempt to analyze the percentage of bots within the friends of human-operated Twitter accounts, as a symptom for being more tempted to disinformation.

In [2], human behaviour on Facebook is analyzed by building graphs that capture sequence of activities. Behavioural patterns that do not match any of the known benign models likely signal malicious objectives. Similarly, the realization of a classifier to automatically recognize credulous users is the first step to derive their sequence of activities and, hopefully, peculiar behavioral patterns.

In [11, 14], a behavioural analysis of bots and humans on Twitter is performed, to draw fundamental differences between the two groups. Specifically, the former demonstrates how, despite a higher level of synchronization characterizing bot accounts, the human behaviour on Twitter is far from being random. The latter defines a ‘credibility score’ as a measure of how many tweets by bots are present in the timeline of an account. Our work supports the discrimination of credulous users and it may lead to a deeper characterization of human accounts.

To the best of our knowledge, few research explores ways to automatically recognize those Twitter users susceptible to attacks of social bots or exposed to disinformation. A notable example in [27] builds on interactions (mentions, replies, retweets and friendship) between genuine and bot accounts, to obtain a ground truth of users susceptible to social bots. Then, similar to our approach, different learning algorithms have been adopted to train a classifier. Contrary to their approach, the current work is able to classify users close to social bots with lightweight features, all computed from data available in the user’s profile. Another brand new line of research is the detection of users susceptible to fake news. Work in [25] monitors the replies of Twitter users to a priori known fake news, in order to tag the same users as vulnerable to disinformation or not. Then, a supervised classification task is launched, to train a model able to classify gullible users, according to content-, user-, and network-based features.

6 Conclusions

Inspired by recent literature that shows how disinformation is not only promoted by social bots but also emphasized by genuine peers, in this work we proposed a supervised classification engine to discriminate credulous users, i.e., human-operated accounts with a high percentage of bots as friends. The classifier achieves very good performances and avoids a heavy feature engineering and extraction phase. Further research efforts will be devoted to investigate the behaviour of credulous users, as well as the posted content, to know more about their peculiarities and the quality of information they contribute to diffuse.

Notes

1.
Bot Repository Datasets: https://goo.gl/87Kzcr.
2.
Twitter API: https://goo.gl/njcjr1.
3.
https://botometer.iuni.iu.edu/.
4.
Complete Automation Probability: https://tinyurl.com/yxp3wqzh.
5.
English/Universal Score: https://tinyurl.com/y2skbmqc.
6.
ClassA features require only information available in the profile of the account [10].

References

Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)
MATH Google Scholar
Amato, F., et al.: Recognizing human behaviours in online social networks. Comput. Secur. 74, 355–370 (2018)
Article Google Scholar
Balestrucci, A., et al.: Identification of credulous users on Twitter. In: ACM Symposium of Applied Computing (2019)
Google Scholar
Bastos, M.T., Mercea, D.: The Brexit botnet and user-generated hyperpartisan news. Soc. Sci. Comput. Rev. 37(1), 38–54 (2019)
Article Google Scholar
Bovet, A., Makse, H.A.: Influence of fake news in Twitter during the 2016 US presidential election. Nat. Commun. 10(1), 7 (2019)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Chatzakou, D., et al.: Mean birds: detecting aggression and bullying on Twitter. In: ACM Web Science Conference, pp. 13–22 (2017)
Google Scholar
Chavoshi, N., et al.: DeBot: Twitter bot detection via warped correlation. In: Data Mining, pp. 817–822 (2016)
Google Scholar
Cohen, W.: Fast effective rule induction. In: Machine Learning, pp. 115–123 (1995)
Chapter Google Scholar
Cresci, S., et al.: Fame for sale: efficient detection of fake Twitter followers. Decis. Support Syst. 80, 56–71 (2015)
Article Google Scholar
Cresci, S., et al.: Exploiting digital DNA for the analysis of similarities in Twitter behaviours. In: IEEE Data Science and Advanced Analytics, pp. 686–695 (2017)
Google Scholar
Cresci, S., et al.: The paradigm-shift of social spambots: evidence, theories, and tools for the arms race. In: 26th World Wide Web, Companion, pp. 963–972 (2017)
Google Scholar
Ferrara, E., et al.: The rise of social bots. Commun. ACM 59(7), 96–104 (2016)
Article Google Scholar
Gilani, Z., et al.: A large-scale behavioural analysis of bots and humans on Twitter. ACM Trans. Web 13(1), 7 (2019)
Article Google Scholar
Holte, R.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63–91 (1993)
Article Google Scholar
Jin, L., et al.: Understanding user behavior in online social networks: a survey. IEEE Commun. Mag. 51(9), 144–150 (2013)
Article Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: 11th Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Google Scholar
Lee, J., et al.: An iterative undersampling of extremely imbalanced data using CSVM. In: Machine Vision, vol. 9445 (2014)
Google Scholar
Minnich, A., et al.: BotWalk: efficient adaptive exploration of Twitter bot networks. In: ASONAM, pp. 467–474 (2017)
Google Scholar
Mønsted, B., et al.: Evidence of complex contagion of information in socialmedia: an experiment using Twitter bots. PLoS ONE 12(9), e0184148 (2017)
Article Google Scholar
Pal, S.K., Mitra, S.: Multilayer perceptron, fuzzy sets, and classification. IEEE Trans. Neural Networks 3(5), 683–697 (1992)
Article Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods - Support Vector Learning (1998)
Google Scholar
Quinlan, J.R.: Simplifying decision trees. Int. J. Human Comput. Stud. 27(3), 221–234 (1987)
Google Scholar
Shao, C., et al.: The spread of low-credibility content by social bots. Nature Commun. 9(1), 4787 (2018)
Article Google Scholar
Shen, T.J., et al.: How gullible are you? Predicting susceptibility to fake news. In: Web Science, pp. 287–288 (2019)
Google Scholar
Varol, O., et al.: Online human-bot interactions: detection, estimation, and characterization. In: 11th Web and Social Media, pp. 280–289 (2017)
Google Scholar
Wagner, C., et al.: When social bots attack: modeling susceptibility of users in online social networks. In: Making Sense of Microposts, pp. 41–48 (2012)
Google Scholar
Witten, I.H., et al.: Data Mining: Practical Machine Learning Tools and Techniques (2016)
Google Scholar
Yang, K.C., et al.: Arming the public with artificial intelligence to counter social bots. Hum. Behav. Emerg. Technol. 1(1), 48–61 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Gran Sasso Science Institute, L’Aquila, Italy
Alessandro Balestrucci & Catia Trubiani
IMT School for Advanced Studies, Lucca, Italy
Rocco De Nicola & Marinella Petrocchi
Institute of Informatics and Telematics (IIT-CNR), Pisa, Italy
Marinella Petrocchi

Authors

Alessandro Balestrucci
View author publications
You can also search for this author in PubMed Google Scholar
Rocco De Nicola
View author publications
You can also search for this author in PubMed Google Scholar
Marinella Petrocchi
View author publications
You can also search for this author in PubMed Google Scholar
Catia Trubiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessandro Balestrucci .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Technical University of Madrid, Madrid, Spain
David Camacho
University of Birmingham, Birmingham, UK
Peter Tino
University of Huelva, Huelva, Spain
Antonio J. Tallón-Ballesteros
University of Exeter, Exeter, UK
Ronaldo Menezes
University of Manchester, Manchester, UK
Richard Allmendinger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Balestrucci, A., De Nicola, R., Petrocchi, M., Trubiani, C. (2019). Do You Really Follow Them? Automatic Detection of Credulous Twitter Users. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A., Menezes, R., Allmendinger, R. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2019. IDEAL 2019. Lecture Notes in Computer Science(), vol 11871. Springer, Cham. https://doi.org/10.1007/978-3-030-33607-3_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-33607-3_44
Published: 18 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33606-6
Online ISBN: 978-3-030-33607-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Do You Really Follow Them? Automatic Detection of Credulous Twitter Users

Abstract

Similar content being viewed by others

Friend or Foe: Twitter Users under Magnification

Changing Perspectives: Is It Sufficient to Detect Social Bots?

Identification of Social Accounts’ Responses Using Machine Learning Techniques

Keywords

1 Introduction

2 The Approach

2.1 Datasets

2.2 Bot Detection

2.3 Identification of Credulous Twitter Users

2.4 Classification of Credulous Twitter Users

3 Experimental Results

4 Discussion

5 Related Work

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Do You Really Follow Them? Automatic Detection of Credulous Twitter Users

Abstract

Similar content being viewed by others

Friend or Foe: Twitter Users under Magnification

Changing Perspectives: Is It Sufficient to Detect Social Bots?

Identification of Social Accounts’ Responses Using Machine Learning Techniques

Keywords

1 Introduction

2 The Approach

2.1 Datasets

2.2 Bot Detection

2.3 Identification of Credulous Twitter Users

2.4 Classification of Credulous Twitter Users

3 Experimental Results

4 Discussion

5 Related Work

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation