Person, Organization, or Personage: Towards User Account Type Prediction in Microblogs

Samborskii, Ivan; Filchenkov, Andrey; Korneev, Georgiy; Farseev, Alex

doi:10.1007/978-3-030-13283-5_9

Ivan Samborskii^12,13,
Andrey Filchenkov¹²,
Georgiy Korneev¹² &
…
Alex Farseev^12,14

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 947))

Included in the following conference series:

International Conference on Electronic Governance and Open Society: Challenges in Eurasia

1027 Accesses

Abstract

During the past decade, microblog services have been extensively utilized by millions of business and private users as one of the most powerful information broadcasting tools. For example, Twitter attracted many social science researchers due to its high popularity, constrained format of thought expression, and the ability to react actual trends. However, unstructured data from microblogs often suffer from the lack of representativeness due to the tremendous amount of noise. Such noise is often introduced by the activity of organizational and fake user ac-counts that may not be useful in many application domains. Aiming to tackle the information filtering problem, in this paper, we classify Twitter accounts into three categories: “Personal”, “Organization”, and “Personage”. Specifically, we utilize various text-based data representation approaches to extract features for our proposed microblog account type prediction framework “POP-MAP”. To study the problem at a cross-language level, we harvested and learned from a multi-lingual Twitter dataset, which allows us to achieve better classification performance, as compared to various state-of-the-art baselines.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Detecting Marionette Microblog Users for Improved Information Credibility

Article 14 September 2015

Detecting User Preference on Microblog

Content Mining of Microblogs

Keywords

1 Introduction

Web scientists use social media as a rich source of information about users’ individuality, behavior, and preferences [9, 13, 15, 25]. It is used to recover user profile [3, 10, 12] and make targeted recommendation [11, 19]. The availability of these personal user attributes allows them to compete with traditional sociologists, epidemiologist and political experts in such tasks as voting outcome prediction [14, 24], disease outbreaks prediction [7, 17], or group population visualization [1]. However, the representativeness of the data in most of web science studies is extremely low due to the significant level of noise.

The noise in social media is often related to the fact that not all accounts represent a real human. For example, this can be caused by specific bots that mimic human behavior while being governed by an algorithm or another human. Many works are devoted to detecting such accounts [4,5,6, 26]. At the same time, some microblog accounts may not represent a person, but be related to something else: accounts of corporations (Adidas^{Footnote 1}), banks (DBS bank^{Footnote 2}), museums (The State Hermitage Museum^{Footnote 3}), animals (Grumpy Cat^{Footnote 4}), or personages (such as Harry Potter^{Footnote 5}). These accounts represent a certain subject that may or may not be equipped with the aforementioned personal user attributes (i.e. demographics). However, most of them are irrelevant to social studies.

Nevertheless, most of the existing social media analysis studies either do not perform irrelevant user account filtering [11, 12], perform it manually [16, 22], or do not utilize openly available user-generated data [20, 23]. For example, Tavares et al. [23] presented a method to classify personal and corporate accounts, which solved the problem with 84.6% accuracy. However, the authors did not use user-generated content, which may result in a sub-optimal performance due to the lack of data representativeness. At the same time, Oentaryo et al. [20] utilized contextual, social, and temporal features, which allowed for achieving 91% account type classification accuracy by gradient boosting algorithm. However, the employed data types are often not available for public use, which constrains the applicability of the proposed approach to real-world scenario.

Indeed, in our study, we perform the task of microblog user account type inference based on textual user-generated content only, which makes it applicable in the real-world settings. We assume that textual data is sufficient for achieving high classification performance and train our-proposed “POP-MAP” framework to perform “Person”-“Organization”-“Personage” Microblog Account Prediction.

2 On Microblog Account Typization

Microblog is a specific type of social media resource, which allows its users to share short status updates to their subscribers. One of the most well-known microblogs is Twitter, where messages (statuses) are publicly accessible in contrast to other big social networks, such as Facebook, and the length of message cannot exceed 140 symbols (280 since the end of 2017), which makes its posts standardized and rarely representing more than one topic [28].

According to Barone et al. [2], each Twitter account belongs to one of the following five types:

1.
Corporate Account, which is typically a company news feed: Facebook^{Footnote 6}, Google^{Footnote 7}, Yandex^{Footnote 8}, and VKontakte^{Footnote 9}.
2.
Corporate-led Persona Account, which is a corporate account that includes both personal and business sides. For example, an account of online shop Zappos^{Footnote 10} is Tony Hsieh’s account, in fact.
3.
Strictly Personal Account is an account representing an individual microblog user.
4.
Business/Personal Hybrid Account is a mixture of the personal account and professional account types, where most of the tweets contain information about its user, but also a considerable number of tweets is dedicated to the user’s professional interests. Accounts of famous people usually belong to this type, for example, Pavel Durov^{Footnote 11} or Jimmy Wales^{Footnote 12} accounts.
5.
Personage Account, which is the personage-based account that typically is an animal, plant, or fictional hero.

In this paper, we adopt three most popular accounts types from the above categorization: organization account, personal account, and personage account. The other two hybrid types are considered to be a part of the selected ones, so that all the Corporate-led Persona Accounts are treated as organization accounts, while Business/Personal Accounts are considered to be personal accounts.

3 Feature Extraction

Classification algorithms strongly depend on features, which describe objects. Thus, feature engineering is a key step in solving most of the data mining problems. In this section, we de ne all the features we used to describe a Twitter account.

Words Frequency.

Individual users typically use everyday vocabulary in their tweets, while organizations may adopt a domain-specific vocabulary that can be a good indicator of the organization account type. In accordance with this assumption, we use the following features:

average word frequency among all words in tweet;
average word frequency among all words in all user’s tweets.

We utilized Sharov’s Frequency Dictionary^{Footnote 13} and Word frequency data^{Footnote 14} for obtaining general usage frequency of Russian and English words respectively.

Spelling Mistakes.

It is well-known that individual user accounts tend to post more grammatical mistakes/misspellings as compared to properly-maintained organizational accounts. Inspired by this phenomenon, we utilized Language-Tool^{Footnote 15} to extract the number of mistakes/misspellings per account.

Hashtags.

Hashtags are often used for grouping microblog messages and improvement of Twitter search. Personal accounts are characterized by extensive use of hashtags to express their thoughts, feelings, as compared to corporate accounts. We thus extracted the following hashtag-based features:

average number of unique hashtags per account;
average number of hashtags per tweet;
average length of hashtag per tweet.

Users’ Mentions.

Similar to hashtags, user mentions spread in social networks. However, we cannot expect personage accounts to use them often due to the lower number of actual social ties between them and individual Twitter users. To incorporate this aspect, we extracted the following user mention features:

average number of unique mentions per account;
average number of mentions per tweet;
average length of mention per tweet.

Tweet/Word Length.

Many acronyms (i.e. “gotcha” meaning “I got you”) widespread among users of social networks. The reason is that they are useful to t in more information into short twitter message. These acronyms, however, are not popular among organizational twitter accounts. Therefore, we extracted the following features representing text length:

average length of word per account;
average length of tweet per account.

Part of Speech (POS).

To reflect different styles of language use, we included features related to words’ POS. The following POS groups have been identified:

noun;
verb;
personal pronoun;
pronoun (others);
adjective;
adverb;
preposition, conjunction, particle;
adverb + adjective;
adverb + adverb.

For each group, we then calculated the following features:

average number of groups per account;
average number of groups per tweet;
average number of negative particles per account.

Personal Words.

Accounts belonging to people or personages can be easily identified by the so-called personal words. Inspired by this fact, we extracted “average number of personal words per account” feature.

Symbols.

Similarly, to previous studies, for each symbol in Table 1, we calculated the following features:

Table 1. Symbols that are used to calculate features.

Full size table

average number of signs per tweet;
average number of unique signs per tweet;
average number of tweets with a sign per account;
average number of a sign per tweet;
average number of tweets with signs per account;
average number of unique signs per account.

Emoticons.

Similar to the symbol features, for each group of emoticons in Table 2, we calculated emotion features:

Table 2. Emoticons groups that are used to calculate features.

Full size table

average number of emoticons per tweet;
average number of tweets with emoticon per account;
average number of a emoticon per tweet;
average number of unique emoticons per account.

Vocabulary Uniqueness.

Organization accounts on Twitter are often created to be used for specific applications. For example, Yandex.Taxi^{Footnote 16} is designed to support taxi services, while Yandex.Market^{Footnote 17} is related to e-commerce services aggregation. Every specific usage domain reduces the diversity of words in organizations’ microblog accounts. Based on this assumption, we extracted the following vocabulary-uniqueness features:

average number of unique words per account;
average number of words not from a vocabulary per account.

Hyperlinks.

Users often post URLs to third-party resources, such as events, pictures, etc. The URL usage can be a good indicator of individual user accounts. Based on this assumption, we extracted the features below:

average number of links per account;
average number of tweets with links per account.

Twitter-Specific Features.

Organization accounts are often characterized by a large number of subscribers (followers), but a relatively small number of subscriptions (following). This is also the case of popular personage accounts. Also, it is worth mentioning that corporate accounts are often verified, which often does not hold for personal accounts, while personage accounts are almost never verified.

number of subscribers;
number of subscriptions;
if the account is verified;
average number of “favorite” tweets.

Overall, there we suggest 136 features for Twitter account type classification. It is worth mentioning that some of them (such as usage of hashtags, hyperlinks, and personal words) were never adapted before and, thus, they are one of the contributions of this study.

4 Experiment Setup

4.1 Data Collection

Due to the lack of publicly available datasets on Twitter account type inference, we collected our dataset. To do so, we developed a crawler for downloading last n = 500 tweets of each specified user, where the list of account names was created manually.

4.2 Utilized Machine Learning Methods

We employed the following commonly-utilized classification baselines that are implemented as part of WEKA^{Footnote 18} machine learning library: k-nearest neighbors, Naïve Bayes classifier, Support Vector Machines (SVM) classifier, Decision Trees (its C4.5 version), and Random Forest. These algorithms were applied to the profiles represented by our-extracted POP-MAP features that were presented in Sect. 3.

We used several feature selection (FS) algorithms [27] to select only representative features:

dependency-based elimination, such as: CFS-BiS, CFS-GS, CFS-LS, CFS-RS, CFS-SBS, CFS-SFS, CFS-SWS, CFS-TS;
consistency-based elimination, such as: Cons-BiS, Cons-GS, Cons-LS, Cons-RS, Cons-SBS, Cons-SFS, Cons-SWS;
Significant algorithm, which is based on estimating feature “significance”;
ReliefF measures feature importance based on comparison to similar objects of the same class.

In addition, we utilized the well-known dimensionality reduction algorithm PCA that is also implemented in WEKA.

To evaluate the prediction performance by using the two well-adopted evaluation measures: accuracy and F-measure. We organized model evaluation using 5-fold cross validation.

5 Experiments on Russian Text Corpora

We have collected the sample consisting of 298 Russian personal accounts, 160 Russian organization accounts and 151 Russian personage accounts by the tool and method, described in the previous section.

5.1 Comparing Baselines

Since there are no existing solutions for the problem of microblog account type inference, we consider standard text classification techniques as our baselines:

Naïve Bayes (NB) is a simple Naïve Bayes classifier with minor preprocessing (all hyperlinks are removed and letters are changed to lowercase) [8].
Classifier with stemmer (Stemmer) is NB with Porter’s stemmer applied [21].
Classifier with emoticons (Emoticon) is the classifier from Lin [18] work, which determines chat users’ age and gender based on emoticons in users’ posts. To implement this method, we identified 500 different emoticons.

The baseline results are presented in Table 3. As we can see, stemming has expectedly improved NB but outperformed Emoticon. This is possibly due to organizations use less formal language in Twitter than we expected.

Table 3. Results of baselines for account classification for the Russian language.

Full size table

5.2 Comparing Approaches Trained POP-MAP Features

MAP without Feature Selection.

We conducted experiments using the setup described in Sect. 4 on the collected dataset. The results are presented in Table 4. The best performance was shown by Random Forest, which is consistent with previous study [12] and can be explained by its feature selection ability.

Table 4. Results for account classification for the Russian language without feature selection.

Full size table

POP-MAP with Feature Selection.

To improve classification performance, we applied dimensionality reduction algorithms described in Sect. 4. First, we applied PCA. As we can see from Table 5, PCA did not improve the classification performance.

Table 5. Results for account classification for the Russian language with PCA.

Full size table

Then we picked the best feature selection algorithm for each classifier with respect to the resulting performance. The evaluation results are presented in Table 6. As it can be seen, feature selection improved performance of all the models. However, Random Forest kept its position of the best classifier, which can be explained by its additional built-in feature selection ability.

Table 6. Results for account classification for the Russian language with feature selection.

Full size table

5.3 Results Summary

From the Table 6, it can be seen that the best performance was achieved by Random Forest classifier on the CFC-TS-preprocessed data. The contingency matrix is presented in Table 7 shows us that the resulting classifier makes a small number of misclassifications, while the most complex task for it is to distinguish between personal accounts and personage accounts. This can be explained by the similar nature of these two types of accounts, which conforms well with manual comparison of such accounts.

Table 7. Contingency table of the best classifier for the Russian language.

Full size table

We used mutual information (MI) measure to estimate feature importance. The most valuable features are average number of personal words per account (0.679), average number of personal pronouns per tweet (0.633), average number of personal words per tweet (0.472), average number of links per account (0.402), and a number of subscriptions (0.378). Among other features with MI greater than 0.2, seven are POS features, one is tweets with links per account, two are tweets length features.

As we can see, the most important features are related to personality and references. We may expect the same situation and for the English language.

6 Experiments on English Text Corpora

6.1 Dataset

To perform evaluation on English corpora, we have collected the sample consisting of 281 English personal accounts, 130 English organization accounts and 130 English personage accounts using the tool and method described in Sect. 3.

6.2 Results

In this setup, we tested only Random Forest since it has shown the ultimate performance for the Russian language. The best-achieved result was after applying Con-GS algorithm selecting 44 features and resulting in 0.894 of accuracy and 0.879 of F-measure. The contingency table is presented in Table 8. The resulting classifier also makes only a small number of mistakes. As we can see, the classifier for English corpora outperforms the best one for Russian corpora classifier.

Table 8. Contingency table of the best classifier for the English language.

Full size table

The most valuable features with respect to the MI are: number of subscriptions (0.709), average number of personal words per account (0.516), if the ac-count is verified (0.479), average number of tweets with links per account (0.290), average number of unique signs per account (0.274). Among other features with MI greater than 0.2, four are symbol features, one is number of subscribers, one is average number of hyperlinks per tweet, and one is average length of tweets.

We can see that personal words are also the strong feature besides Twitter-specific features. However, POS-tagged features are not at the top as in Russia. Instead, symbol-specific features are useful for English.

6.3 Results for Binary Classification

We also compared our results with results, reported in [23], where authors classified microblog accounts only into personal and corporate types. To do so, we selected only personal and organization accounts from the initial datasets and run the best-built classifiers for English and Russian. The results of the comparison are presented in Table 9. As it can be seen, the POP-MAP results on both the Russian and English corpora are similarly high and significantly surpass the behavior-based approach.

Table 9. Results of baselines for account classification for the Russian language.

Full size table

7 Conclusion

In this paper, we addressed the problem of Twitter account classification. We described 136 features, which we then used in different classification models. We run experiments on corpora of Russian and English tweets and achieve similarly high classification performance for both languages with the Random Forest model.

However, we discovered that there is a difference in text feature importance for two languages, while Twitter-specific features have the same importance. The only exception is a strong feature related to personal words that are useful in both English and Russian.

The research is supported by the Government of the Russian Federation, Grant 08-08.

Notes

1.
http://twitter.com/adidas.
2.
http://twitter.com/dbsbank.
3.
http://twitter.com/hermitage_eng.
4.
http://twitter.com/realgrumpycat.
5.
http://twitter.com/arrypottah.
6.
http://twitter.com/facebook.
7.
http://twitter.com/google.
8.
http://twitter.com/yandex.
9.
http://twitter.com/vkontakte.
10.
http://twitter.com/zeppos.
11.
http://twitter.com/durov.
12.
http://twitter.com/jimmy_wales.
13.
http://dict.ruslang.ru/freq.php.
14.
http://www.wordfrequency.info.
15.
http://languagetool.org.
16.
http://twitter.com/yandextaxi.
17.
http://twitter.com/yandexmarket.
18.
http://www.cs.waikato.ac.nz/ml/weka/.

References

Aramaki, E., Maskawa, S., Morita, M.: Twitter catches the flu: detecting influenza epidemics using twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1568–1576. Association for Computational Linguistics (2011)
Google Scholar
Barone, L.: Which type of twitter account should you create? (2010). http://smallbiztrends.com/2010/02/types-of-twitter-accounts.html. Accessed 15 Apr 2016
Bartunov, S., Korshunov, A., Park, S.-T., Ryu, W., Lee, H.: Joint link-attribute user identity resolution in online social networks. In: Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, Workshop on Social Network Mining and Analysis. ACM (2012)
Google Scholar
Boshmaf, Y., Muslukhov, I., Beznosov, K., Ripeanu, M.: Design and analysis of a social botnet. Comput. Netw. 57(2), 556–578 (2013)
Article Google Scholar
Cao, Q., Sirivianos, M., Yang, X., Pregueiro, T.: Aiding the detection of fake accounts in large scale social online services. In: Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, pp. 197–210 (2012)
Google Scholar
Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Who is tweeting on Twitter: human, bot, or cyborg? In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 21–30. ACM (2010)
Google Scholar
Culotta, A.: Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the First Workshop on Social Media Analytics, pp. 115–122. ACM (2010)
Google Scholar
Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., Wei, H.: Gender identification on twitter using the modified balanced winnow. Commun. Netw. 4(3), 1–7 (2012)
Google Scholar
Farseev, A., Akbari, M., Samborskii, I., Chua, T.-S.: 360° user profiling: past, future, and applications. ACM SIGWEB Newslett, (Summer), Article no. 4 (2016)
Google Scholar
Farseev, A., Chua, T.-S.: TweetFit: fusing sensors and multiple social media for wellness profile learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. AAAI (2017)
Google Scholar
Farseev, A., Kotkov, D., Semenov, A., Veijalainen, J., Chua, T.-S.: Cross-social network collaborative recommendation. In: Proceedings of the ACM Web Science Conference, p. 38. ACM (2015)
Google Scholar
Farseev, A., Nie, L., Akbari, M., Chua, T.-S.: Harvesting multiple sources for user profile learning: a big data study. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 235–242. ACM (2015)
Google Scholar
Farseev, A., Samborskii, I., Chua, T.-S.: bBridge: a big data platform for social multimedia analytics. In: Proceedings of the 2016 ACM Conference on Multimedia, pp. 759–761. ACM (2016)
Google Scholar
Filchenkov, A.A., Azarov, A.A., Abramov, M.V.: What is more predictable in social media: election outcome or protest action? In: Proceedings of the 2014 Conference on Electronic Governance and Open Society: Challenges in Eurasia, pp. 157–161. ACM (2014)
Google Scholar
Hendler, J., Shadbolt, N., Hall, W., Berners-Lee, T., Weitzner, D.: Web science: an interdisciplinary approach to understanding the web. Commun. ACM 51(7), 60–69 (2008)
Article Google Scholar
Kafeza, E., Kanavos, A., Makris, C., Vikatos, P.: T-PICE: Twitter personality based influential communities extraction system. In: 2014 IEEE International Congress on Big Data, pp. 212–219. IEEE (2014)
Google Scholar
Lee, K., Agrawal, A., Choudhary, A.: Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1474–1477. ACM (2013)
Google Scholar
Lin, J.: Automatic author profiling of online chat logs. Ph.D. thesis, Monterey, California. Naval Postgraduate School (2007)
Google Scholar
Lin, J., Sugiyama, K., Kan, M.-T., Chua, T.-S.: Addressing cold-start in app recommendation: latent user models constructed from twitter followers. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 283–292. ACM (2013)
Google Scholar
Oentaryo, R.J., Low, J.-W., Lim, E.-P.: Chalk and Cheese in twitter: discriminating personal and organization accounts. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 465–476. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_51
Chapter Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Schwartz, H.A., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One 8(9), e73791 (2013)
Article Google Scholar
Tavares, G., Faisal, A.: Scaling-laws of human broadcast communication enable distinction between human, corporate and robot twitter users. PLoS One 8(7), e65774 (2013)
Article Google Scholar
Tsakalidis, A., Papadopoulos, S., Cristea, A.I., Kompatsiaris, Y.: Predicting elections for multiple countries using twitter and polls. IEEE Intell. Syst. 30(2), 10–17 (2015)
Article Google Scholar
Varlamov, M.I., Turdakov, D.Y.: A survey of methods for the extraction of information from web resources. Program. Comput. Softw. 42(5), 279–291 (2016)
Article Google Scholar
Wang, A.H.: Detecting spam bots in online social networking sites: a machine learning approach. In: Foresti, S., Jajodia, S. (eds.) DBSec 2010. LNCS, vol. 6166, pp. 335–342. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13739-6_25
Chapter Google Scholar
Wang, G., Song, Q., Sun, H., Zhang, X., Xu, B., Zhou, Y.: A feature subset selection algorithm automatic recommendation method. J. Artif. Intell. Res. 47, 1–34 (2013)
Article Google Scholar
Zhao, W.X., et al.: Comparing twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia
Ivan Samborskii, Andrey Filchenkov, Georgiy Korneev & Alex Farseev
National University of Singapore, 13 Computing Dr., Singapore, 117417, Singapore
Ivan Samborskii
SoMin Research, Singapore, Singapore
Alex Farseev

Authors

Ivan Samborskii
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Filchenkov
View author publications
You can also search for this author in PubMed Google Scholar
Georgiy Korneev
View author publications
You can also search for this author in PubMed Google Scholar
Alex Farseev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrey Filchenkov .

Editor information

Editors and Affiliations

eGovernance Center, ITMO University, St. Petersburg, Russia
Andrei Chugunov
Institute of Communications Studies, University of Leeds, Leeds, UK
Yuri Misnikov
North-West Institute of Management, Russian Presidential Academy of National Economy and Public Administration, St. Petersburg, Russia
Evgeny Roshchin
eGovernance Center, ITMO University, St. Petersburg, Russia
Dmitrii Trutnev

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Samborskii, I., Filchenkov, A., Korneev, G., Farseev, A. (2019). Person, Organization, or Personage: Towards User Account Type Prediction in Microblogs. In: Chugunov, A., Misnikov, Y., Roshchin, E., Trutnev, D. (eds) Electronic Governance and Open Society: Challenges in Eurasia. EGOSE 2018. Communications in Computer and Information Science, vol 947. Springer, Cham. https://doi.org/10.1007/978-3-030-13283-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-13283-5_9
Published: 10 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13282-8
Online ISBN: 978-3-030-13283-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Person, Organization, or Personage: Towards User Account Type Prediction in Microblogs

Abstract

Similar content being viewed by others

Detecting Marionette Microblog Users for Improved Information Credibility

Detecting User Preference on Microblog

Content Mining of Microblogs

Keywords

1 Introduction

2 On Microblog Account Typization

3 Feature Extraction

Words Frequency.

Spelling Mistakes.

Hashtags.

Users’ Mentions.

Tweet/Word Length.

Part of Speech (POS).

Personal Words.

Symbols.

Emoticons.

Vocabulary Uniqueness.

Hyperlinks.

Twitter-Specific Features.

4 Experiment Setup

4.1 Data Collection

4.2 Utilized Machine Learning Methods

5 Experiments on Russian Text Corpora

5.1 Comparing Baselines

5.2 Comparing Approaches Trained POP-MAP Features

MAP without Feature Selection.

POP-MAP with Feature Selection.

5.3 Results Summary

6 Experiments on English Text Corpora

6.1 Dataset

6.2 Results

6.3 Results for Binary Classification

7 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation