Abstract
During the past decade, microblog services have been extensively utilized by millions of business and private users as one of the most powerful information broadcasting tools. For example, Twitter attracted many social science researchers due to its high popularity, constrained format of thought expression, and the ability to react actual trends. However, unstructured data from microblogs often suffer from the lack of representativeness due to the tremendous amount of noise. Such noise is often introduced by the activity of organizational and fake user ac-counts that may not be useful in many application domains. Aiming to tackle the information filtering problem, in this paper, we classify Twitter accounts into three categories: “Personal”, “Organization”, and “Personage”. Specifically, we utilize various text-based data representation approaches to extract features for our proposed microblog account type prediction framework “POP-MAP”. To study the problem at a cross-language level, we harvested and learned from a multi-lingual Twitter dataset, which allows us to achieve better classification performance, as compared to various state-of-the-art baselines.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Web scientists use social media as a rich source of information about users’ individuality, behavior, and preferences [9, 13, 15, 25]. It is used to recover user profile [3, 10, 12] and make targeted recommendation [11, 19]. The availability of these personal user attributes allows them to compete with traditional sociologists, epidemiologist and political experts in such tasks as voting outcome prediction [14, 24], disease outbreaks prediction [7, 17], or group population visualization [1]. However, the representativeness of the data in most of web science studies is extremely low due to the significant level of noise.
The noise in social media is often related to the fact that not all accounts represent a real human. For example, this can be caused by specific bots that mimic human behavior while being governed by an algorithm or another human. Many works are devoted to detecting such accounts [4,5,6, 26]. At the same time, some microblog accounts may not represent a person, but be related to something else: accounts of corporations (AdidasFootnote 1), banks (DBS bankFootnote 2), museums (The State Hermitage MuseumFootnote 3), animals (Grumpy CatFootnote 4), or personages (such as Harry PotterFootnote 5). These accounts represent a certain subject that may or may not be equipped with the aforementioned personal user attributes (i.e. demographics). However, most of them are irrelevant to social studies.
Nevertheless, most of the existing social media analysis studies either do not perform irrelevant user account filtering [11, 12], perform it manually [16, 22], or do not utilize openly available user-generated data [20, 23]. For example, Tavares et al. [23] presented a method to classify personal and corporate accounts, which solved the problem with 84.6% accuracy. However, the authors did not use user-generated content, which may result in a sub-optimal performance due to the lack of data representativeness. At the same time, Oentaryo et al. [20] utilized contextual, social, and temporal features, which allowed for achieving 91% account type classification accuracy by gradient boosting algorithm. However, the employed data types are often not available for public use, which constrains the applicability of the proposed approach to real-world scenario.
Indeed, in our study, we perform the task of microblog user account type inference based on textual user-generated content only, which makes it applicable in the real-world settings. We assume that textual data is sufficient for achieving high classification performance and train our-proposed “POP-MAP” framework to perform “Person”-“Organization”-“Personage” Microblog Account Prediction.
2 On Microblog Account Typization
Microblog is a specific type of social media resource, which allows its users to share short status updates to their subscribers. One of the most well-known microblogs is Twitter, where messages (statuses) are publicly accessible in contrast to other big social networks, such as Facebook, and the length of message cannot exceed 140 symbols (280 since the end of 2017), which makes its posts standardized and rarely representing more than one topic [28].
According to Barone et al. [2], each Twitter account belongs to one of the following five types:
-
1.
Corporate Account, which is typically a company news feed: FacebookFootnote 6, GoogleFootnote 7, YandexFootnote 8, and VKontakteFootnote 9.
-
2.
Corporate-led Persona Account, which is a corporate account that includes both personal and business sides. For example, an account of online shop ZapposFootnote 10 is Tony Hsieh’s account, in fact.
-
3.
Strictly Personal Account is an account representing an individual microblog user.
-
4.
Business/Personal Hybrid Account is a mixture of the personal account and professional account types, where most of the tweets contain information about its user, but also a considerable number of tweets is dedicated to the user’s professional interests. Accounts of famous people usually belong to this type, for example, Pavel DurovFootnote 11 or Jimmy WalesFootnote 12 accounts.
-
5.
Personage Account, which is the personage-based account that typically is an animal, plant, or fictional hero.
In this paper, we adopt three most popular accounts types from the above categorization: organization account, personal account, and personage account. The other two hybrid types are considered to be a part of the selected ones, so that all the Corporate-led Persona Accounts are treated as organization accounts, while Business/Personal Accounts are considered to be personal accounts.
3 Feature Extraction
Classification algorithms strongly depend on features, which describe objects. Thus, feature engineering is a key step in solving most of the data mining problems. In this section, we de ne all the features we used to describe a Twitter account.
Words Frequency.
Individual users typically use everyday vocabulary in their tweets, while organizations may adopt a domain-specific vocabulary that can be a good indicator of the organization account type. In accordance with this assumption, we use the following features:
-
average word frequency among all words in tweet;
-
average word frequency among all words in all user’s tweets.
We utilized Sharov’s Frequency DictionaryFootnote 13 and Word frequency dataFootnote 14 for obtaining general usage frequency of Russian and English words respectively.
Spelling Mistakes.
It is well-known that individual user accounts tend to post more grammatical mistakes/misspellings as compared to properly-maintained organizational accounts. Inspired by this phenomenon, we utilized Language-ToolFootnote 15 to extract the number of mistakes/misspellings per account.
Hashtags.
Hashtags are often used for grouping microblog messages and improvement of Twitter search. Personal accounts are characterized by extensive use of hashtags to express their thoughts, feelings, as compared to corporate accounts. We thus extracted the following hashtag-based features:
-
average number of unique hashtags per account;
-
average number of hashtags per tweet;
-
average length of hashtag per tweet.
Users’ Mentions.
Similar to hashtags, user mentions spread in social networks. However, we cannot expect personage accounts to use them often due to the lower number of actual social ties between them and individual Twitter users. To incorporate this aspect, we extracted the following user mention features:
-
average number of unique mentions per account;
-
average number of mentions per tweet;
-
average length of mention per tweet.
Tweet/Word Length.
Many acronyms (i.e. “gotcha” meaning “I got you”) widespread among users of social networks. The reason is that they are useful to t in more information into short twitter message. These acronyms, however, are not popular among organizational twitter accounts. Therefore, we extracted the following features representing text length:
-
average length of word per account;
-
average length of tweet per account.
Part of Speech (POS).
To reflect different styles of language use, we included features related to words’ POS. The following POS groups have been identified:
-
noun;
-
verb;
-
personal pronoun;
-
pronoun (others);
-
adjective;
-
adverb;
-
preposition, conjunction, particle;
-
adverb + adjective;
-
adverb + adverb.
For each group, we then calculated the following features:
-
average number of groups per account;
-
average number of groups per tweet;
-
average number of negative particles per account.
Personal Words.
Accounts belonging to people or personages can be easily identified by the so-called personal words. Inspired by this fact, we extracted “average number of personal words per account” feature.
Symbols.
Similarly, to previous studies, for each symbol in Table 1, we calculated the following features:
-
average number of signs per tweet;
-
average number of unique signs per tweet;
-
average number of tweets with a sign per account;
-
average number of a sign per tweet;
-
average number of tweets with signs per account;
-
average number of unique signs per account.
Emoticons.
Similar to the symbol features, for each group of emoticons in Table 2, we calculated emotion features:
-
average number of emoticons per tweet;
-
average number of tweets with emoticon per account;
-
average number of a emoticon per tweet;
-
average number of unique emoticons per account.
Vocabulary Uniqueness.
Organization accounts on Twitter are often created to be used for specific applications. For example, Yandex.TaxiFootnote 16 is designed to support taxi services, while Yandex.MarketFootnote 17 is related to e-commerce services aggregation. Every specific usage domain reduces the diversity of words in organizations’ microblog accounts. Based on this assumption, we extracted the following vocabulary-uniqueness features:
-
average number of unique words per account;
-
average number of words not from a vocabulary per account.
Hyperlinks.
Users often post URLs to third-party resources, such as events, pictures, etc. The URL usage can be a good indicator of individual user accounts. Based on this assumption, we extracted the features below:
-
average number of links per account;
-
average number of tweets with links per account.
Twitter-Specific Features.
Organization accounts are often characterized by a large number of subscribers (followers), but a relatively small number of subscriptions (following). This is also the case of popular personage accounts. Also, it is worth mentioning that corporate accounts are often verified, which often does not hold for personal accounts, while personage accounts are almost never verified.
-
number of subscribers;
-
number of subscriptions;
-
if the account is verified;
-
average number of “favorite” tweets.
Overall, there we suggest 136 features for Twitter account type classification. It is worth mentioning that some of them (such as usage of hashtags, hyperlinks, and personal words) were never adapted before and, thus, they are one of the contributions of this study.
4 Experiment Setup
4.1 Data Collection
Due to the lack of publicly available datasets on Twitter account type inference, we collected our dataset. To do so, we developed a crawler for downloading last n = 500 tweets of each specified user, where the list of account names was created manually.
4.2 Utilized Machine Learning Methods
We employed the following commonly-utilized classification baselines that are implemented as part of WEKAFootnote 18 machine learning library: k-nearest neighbors, Naïve Bayes classifier, Support Vector Machines (SVM) classifier, Decision Trees (its C4.5 version), and Random Forest. These algorithms were applied to the profiles represented by our-extracted POP-MAP features that were presented in Sect. 3.
We used several feature selection (FS) algorithms [27] to select only representative features:
-
dependency-based elimination, such as: CFS-BiS, CFS-GS, CFS-LS, CFS-RS, CFS-SBS, CFS-SFS, CFS-SWS, CFS-TS;
-
consistency-based elimination, such as: Cons-BiS, Cons-GS, Cons-LS, Cons-RS, Cons-SBS, Cons-SFS, Cons-SWS;
-
Significant algorithm, which is based on estimating feature “significance”;
-
ReliefF measures feature importance based on comparison to similar objects of the same class.
In addition, we utilized the well-known dimensionality reduction algorithm PCA that is also implemented in WEKA.
To evaluate the prediction performance by using the two well-adopted evaluation measures: accuracy and F-measure. We organized model evaluation using 5-fold cross validation.
5 Experiments on Russian Text Corpora
We have collected the sample consisting of 298 Russian personal accounts, 160 Russian organization accounts and 151 Russian personage accounts by the tool and method, described in the previous section.
5.1 Comparing Baselines
Since there are no existing solutions for the problem of microblog account type inference, we consider standard text classification techniques as our baselines:
-
Naïve Bayes (NB) is a simple Naïve Bayes classifier with minor preprocessing (all hyperlinks are removed and letters are changed to lowercase) [8].
-
Classifier with stemmer (Stemmer) is NB with Porter’s stemmer applied [21].
-
Classifier with emoticons (Emoticon) is the classifier from Lin [18] work, which determines chat users’ age and gender based on emoticons in users’ posts. To implement this method, we identified 500 different emoticons.
The baseline results are presented in Table 3. As we can see, stemming has expectedly improved NB but outperformed Emoticon. This is possibly due to organizations use less formal language in Twitter than we expected.
5.2 Comparing Approaches Trained POP-MAP Features
MAP without Feature Selection.
We conducted experiments using the setup described in Sect. 4 on the collected dataset. The results are presented in Table 4. The best performance was shown by Random Forest, which is consistent with previous study [12] and can be explained by its feature selection ability.
POP-MAP with Feature Selection.
To improve classification performance, we applied dimensionality reduction algorithms described in Sect. 4. First, we applied PCA. As we can see from Table 5, PCA did not improve the classification performance.
Then we picked the best feature selection algorithm for each classifier with respect to the resulting performance. The evaluation results are presented in Table 6. As it can be seen, feature selection improved performance of all the models. However, Random Forest kept its position of the best classifier, which can be explained by its additional built-in feature selection ability.
5.3 Results Summary
From the Table 6, it can be seen that the best performance was achieved by Random Forest classifier on the CFC-TS-preprocessed data. The contingency matrix is presented in Table 7 shows us that the resulting classifier makes a small number of misclassifications, while the most complex task for it is to distinguish between personal accounts and personage accounts. This can be explained by the similar nature of these two types of accounts, which conforms well with manual comparison of such accounts.
We used mutual information (MI) measure to estimate feature importance. The most valuable features are average number of personal words per account (0.679), average number of personal pronouns per tweet (0.633), average number of personal words per tweet (0.472), average number of links per account (0.402), and a number of subscriptions (0.378). Among other features with MI greater than 0.2, seven are POS features, one is tweets with links per account, two are tweets length features.
As we can see, the most important features are related to personality and references. We may expect the same situation and for the English language.
6 Experiments on English Text Corpora
6.1 Dataset
To perform evaluation on English corpora, we have collected the sample consisting of 281 English personal accounts, 130 English organization accounts and 130 English personage accounts using the tool and method described in Sect. 3.
6.2 Results
In this setup, we tested only Random Forest since it has shown the ultimate performance for the Russian language. The best-achieved result was after applying Con-GS algorithm selecting 44 features and resulting in 0.894 of accuracy and 0.879 of F-measure. The contingency table is presented in Table 8. The resulting classifier also makes only a small number of mistakes. As we can see, the classifier for English corpora outperforms the best one for Russian corpora classifier.
The most valuable features with respect to the MI are: number of subscriptions (0.709), average number of personal words per account (0.516), if the ac-count is verified (0.479), average number of tweets with links per account (0.290), average number of unique signs per account (0.274). Among other features with MI greater than 0.2, four are symbol features, one is number of subscribers, one is average number of hyperlinks per tweet, and one is average length of tweets.
We can see that personal words are also the strong feature besides Twitter-specific features. However, POS-tagged features are not at the top as in Russia. Instead, symbol-specific features are useful for English.
6.3 Results for Binary Classification
We also compared our results with results, reported in [23], where authors classified microblog accounts only into personal and corporate types. To do so, we selected only personal and organization accounts from the initial datasets and run the best-built classifiers for English and Russian. The results of the comparison are presented in Table 9. As it can be seen, the POP-MAP results on both the Russian and English corpora are similarly high and significantly surpass the behavior-based approach.
7 Conclusion
In this paper, we addressed the problem of Twitter account classification. We described 136 features, which we then used in different classification models. We run experiments on corpora of Russian and English tweets and achieve similarly high classification performance for both languages with the Random Forest model.
However, we discovered that there is a difference in text feature importance for two languages, while Twitter-specific features have the same importance. The only exception is a strong feature related to personal words that are useful in both English and Russian.
The research is supported by the Government of the Russian Federation, Grant 08-08.
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
References
Aramaki, E., Maskawa, S., Morita, M.: Twitter catches the flu: detecting influenza epidemics using twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1568–1576. Association for Computational Linguistics (2011)
Barone, L.: Which type of twitter account should you create? (2010). http://smallbiztrends.com/2010/02/types-of-twitter-accounts.html. Accessed 15 Apr 2016
Bartunov, S., Korshunov, A., Park, S.-T., Ryu, W., Lee, H.: Joint link-attribute user identity resolution in online social networks. In: Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, Workshop on Social Network Mining and Analysis. ACM (2012)
Boshmaf, Y., Muslukhov, I., Beznosov, K., Ripeanu, M.: Design and analysis of a social botnet. Comput. Netw. 57(2), 556–578 (2013)
Cao, Q., Sirivianos, M., Yang, X., Pregueiro, T.: Aiding the detection of fake accounts in large scale social online services. In: Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, pp. 197–210 (2012)
Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Who is tweeting on Twitter: human, bot, or cyborg? In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 21–30. ACM (2010)
Culotta, A.: Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the First Workshop on Social Media Analytics, pp. 115–122. ACM (2010)
Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., Wei, H.: Gender identification on twitter using the modified balanced winnow. Commun. Netw. 4(3), 1–7 (2012)
Farseev, A., Akbari, M., Samborskii, I., Chua, T.-S.: 360° user profiling: past, future, and applications. ACM SIGWEB Newslett, (Summer), Article no. 4 (2016)
Farseev, A., Chua, T.-S.: TweetFit: fusing sensors and multiple social media for wellness profile learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. AAAI (2017)
Farseev, A., Kotkov, D., Semenov, A., Veijalainen, J., Chua, T.-S.: Cross-social network collaborative recommendation. In: Proceedings of the ACM Web Science Conference, p. 38. ACM (2015)
Farseev, A., Nie, L., Akbari, M., Chua, T.-S.: Harvesting multiple sources for user profile learning: a big data study. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 235–242. ACM (2015)
Farseev, A., Samborskii, I., Chua, T.-S.: bBridge: a big data platform for social multimedia analytics. In: Proceedings of the 2016 ACM Conference on Multimedia, pp. 759–761. ACM (2016)
Filchenkov, A.A., Azarov, A.A., Abramov, M.V.: What is more predictable in social media: election outcome or protest action? In: Proceedings of the 2014 Conference on Electronic Governance and Open Society: Challenges in Eurasia, pp. 157–161. ACM (2014)
Hendler, J., Shadbolt, N., Hall, W., Berners-Lee, T., Weitzner, D.: Web science: an interdisciplinary approach to understanding the web. Commun. ACM 51(7), 60–69 (2008)
Kafeza, E., Kanavos, A., Makris, C., Vikatos, P.: T-PICE: Twitter personality based influential communities extraction system. In: 2014 IEEE International Congress on Big Data, pp. 212–219. IEEE (2014)
Lee, K., Agrawal, A., Choudhary, A.: Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1474–1477. ACM (2013)
Lin, J.: Automatic author profiling of online chat logs. Ph.D. thesis, Monterey, California. Naval Postgraduate School (2007)
Lin, J., Sugiyama, K., Kan, M.-T., Chua, T.-S.: Addressing cold-start in app recommendation: latent user models constructed from twitter followers. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 283–292. ACM (2013)
Oentaryo, R.J., Low, J.-W., Lim, E.-P.: Chalk and Cheese in twitter: discriminating personal and organization accounts. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 465–476. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_51
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One 8(9), e73791 (2013)
Tavares, G., Faisal, A.: Scaling-laws of human broadcast communication enable distinction between human, corporate and robot twitter users. PLoS One 8(7), e65774 (2013)
Tsakalidis, A., Papadopoulos, S., Cristea, A.I., Kompatsiaris, Y.: Predicting elections for multiple countries using twitter and polls. IEEE Intell. Syst. 30(2), 10–17 (2015)
Varlamov, M.I., Turdakov, D.Y.: A survey of methods for the extraction of information from web resources. Program. Comput. Softw. 42(5), 279–291 (2016)
Wang, A.H.: Detecting spam bots in online social networking sites: a machine learning approach. In: Foresti, S., Jajodia, S. (eds.) DBSec 2010. LNCS, vol. 6166, pp. 335–342. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13739-6_25
Wang, G., Song, Q., Sun, H., Zhang, X., Xu, B., Zhou, Y.: A feature subset selection algorithm automatic recommendation method. J. Artif. Intell. Res. 47, 1–34 (2013)
Zhao, W.X., et al.: Comparing twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Samborskii, I., Filchenkov, A., Korneev, G., Farseev, A. (2019). Person, Organization, or Personage: Towards User Account Type Prediction in Microblogs. In: Chugunov, A., Misnikov, Y., Roshchin, E., Trutnev, D. (eds) Electronic Governance and Open Society: Challenges in Eurasia. EGOSE 2018. Communications in Computer and Information Science, vol 947. Springer, Cham. https://doi.org/10.1007/978-3-030-13283-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-13283-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13282-8
Online ISBN: 978-3-030-13283-5
eBook Packages: Computer ScienceComputer Science (R0)