Keywords

1 Introduction

Web scientists use social media as a rich source of information about users’ individuality, behavior, and preferences [9, 13, 15, 25]. It is used to recover user profile [3, 10, 12] and make targeted recommendation [11, 19]. The availability of these personal user attributes allows them to compete with traditional sociologists, epidemiologist and political experts in such tasks as voting outcome prediction [14, 24], disease outbreaks prediction [7, 17], or group population visualization [1]. However, the representativeness of the data in most of web science studies is extremely low due to the significant level of noise.

The noise in social media is often related to the fact that not all accounts represent a real human. For example, this can be caused by specific bots that mimic human behavior while being governed by an algorithm or another human. Many works are devoted to detecting such accounts [4,5,6, 26]. At the same time, some microblog accounts may not represent a person, but be related to something else: accounts of corporations (AdidasFootnote 1), banks (DBS bankFootnote 2), museums (The State Hermitage MuseumFootnote 3), animals (Grumpy CatFootnote 4), or personages (such as Harry PotterFootnote 5). These accounts represent a certain subject that may or may not be equipped with the aforementioned personal user attributes (i.e. demographics). However, most of them are irrelevant to social studies.

Nevertheless, most of the existing social media analysis studies either do not perform irrelevant user account filtering [11, 12], perform it manually [16, 22], or do not utilize openly available user-generated data [20, 23]. For example, Tavares et al. [23] presented a method to classify personal and corporate accounts, which solved the problem with 84.6% accuracy. However, the authors did not use user-generated content, which may result in a sub-optimal performance due to the lack of data representativeness. At the same time, Oentaryo et al. [20] utilized contextual, social, and temporal features, which allowed for achieving 91% account type classification accuracy by gradient boosting algorithm. However, the employed data types are often not available for public use, which constrains the applicability of the proposed approach to real-world scenario.

Indeed, in our study, we perform the task of microblog user account type inference based on textual user-generated content only, which makes it applicable in the real-world settings. We assume that textual data is sufficient for achieving high classification performance and train our-proposed “POP-MAP” framework to perform “Person”-“Organization”-“Personage” Microblog Account Prediction.

2 On Microblog Account Typization

Microblog is a specific type of social media resource, which allows its users to share short status updates to their subscribers. One of the most well-known microblogs is Twitter, where messages (statuses) are publicly accessible in contrast to other big social networks, such as Facebook, and the length of message cannot exceed 140 symbols (280 since the end of 2017), which makes its posts standardized and rarely representing more than one topic [28].

According to Barone et al. [2], each Twitter account belongs to one of the following five types:

  1. 1.

    Corporate Account, which is typically a company news feed: FacebookFootnote 6, GoogleFootnote 7, YandexFootnote 8, and VKontakteFootnote 9.

  2. 2.

    Corporate-led Persona Account, which is a corporate account that includes both personal and business sides. For example, an account of online shop ZapposFootnote 10 is Tony Hsieh’s account, in fact.

  3. 3.

    Strictly Personal Account is an account representing an individual microblog user.

  4. 4.

    Business/Personal Hybrid Account is a mixture of the personal account and professional account types, where most of the tweets contain information about its user, but also a considerable number of tweets is dedicated to the user’s professional interests. Accounts of famous people usually belong to this type, for example, Pavel DurovFootnote 11 or Jimmy WalesFootnote 12 accounts.

  5. 5.

    Personage Account, which is the personage-based account that typically is an animal, plant, or fictional hero.

In this paper, we adopt three most popular accounts types from the above categorization: organization account, personal account, and personage account. The other two hybrid types are considered to be a part of the selected ones, so that all the Corporate-led Persona Accounts are treated as organization accounts, while Business/Personal Accounts are considered to be personal accounts.

3 Feature Extraction

Classification algorithms strongly depend on features, which describe objects. Thus, feature engineering is a key step in solving most of the data mining problems. In this section, we de ne all the features we used to describe a Twitter account.

Words Frequency.

Individual users typically use everyday vocabulary in their tweets, while organizations may adopt a domain-specific vocabulary that can be a good indicator of the organization account type. In accordance with this assumption, we use the following features:

  • average word frequency among all words in tweet;

  • average word frequency among all words in all user’s tweets.

We utilized Sharov’s Frequency DictionaryFootnote 13 and Word frequency dataFootnote 14 for obtaining general usage frequency of Russian and English words respectively.

Spelling Mistakes.

It is well-known that individual user accounts tend to post more grammatical mistakes/misspellings as compared to properly-maintained organizational accounts. Inspired by this phenomenon, we utilized Language-ToolFootnote 15 to extract the number of mistakes/misspellings per account.

Hashtags.

Hashtags are often used for grouping microblog messages and improvement of Twitter search. Personal accounts are characterized by extensive use of hashtags to express their thoughts, feelings, as compared to corporate accounts. We thus extracted the following hashtag-based features:

  • average number of unique hashtags per account;

  • average number of hashtags per tweet;

  • average length of hashtag per tweet.

Users’ Mentions.

Similar to hashtags, user mentions spread in social networks. However, we cannot expect personage accounts to use them often due to the lower number of actual social ties between them and individual Twitter users. To incorporate this aspect, we extracted the following user mention features:

  • average number of unique mentions per account;

  • average number of mentions per tweet;

  • average length of mention per tweet.

Tweet/Word Length.

Many acronyms (i.e. “gotcha” meaning “I got you”) widespread among users of social networks. The reason is that they are useful to t in more information into short twitter message. These acronyms, however, are not popular among organizational twitter accounts. Therefore, we extracted the following features representing text length:

  • average length of word per account;

  • average length of tweet per account.

Part of Speech (POS).

To reflect different styles of language use, we included features related to words’ POS. The following POS groups have been identified:

  • noun;

  • verb;

  • personal pronoun;

  • pronoun (others);

  • adjective;

  • adverb;

  • preposition, conjunction, particle;

  • adverb + adjective;

  • adverb + adverb.

For each group, we then calculated the following features:

  • average number of groups per account;

  • average number of groups per tweet;

  • average number of negative particles per account.

Personal Words.

Accounts belonging to people or personages can be easily identified by the so-called personal words. Inspired by this fact, we extracted “average number of personal words per account” feature.

Symbols.

Similarly, to previous studies, for each symbol in Table 1, we calculated the following features:

Table 1. Symbols that are used to calculate features.
  • average number of signs per tweet;

  • average number of unique signs per tweet;

  • average number of tweets with a sign per account;

  • average number of a sign per tweet;

  • average number of tweets with signs per account;

  • average number of unique signs per account.

Emoticons.

Similar to the symbol features, for each group of emoticons in Table 2, we calculated emotion features:

Table 2. Emoticons groups that are used to calculate features.
  • average number of emoticons per tweet;

  • average number of tweets with emoticon per account;

  • average number of a emoticon per tweet;

  • average number of unique emoticons per account.

Vocabulary Uniqueness.

Organization accounts on Twitter are often created to be used for specific applications. For example, Yandex.TaxiFootnote 16 is designed to support taxi services, while Yandex.MarketFootnote 17 is related to e-commerce services aggregation. Every specific usage domain reduces the diversity of words in organizations’ microblog accounts. Based on this assumption, we extracted the following vocabulary-uniqueness features:

  • average number of unique words per account;

  • average number of words not from a vocabulary per account.

Hyperlinks.

Users often post URLs to third-party resources, such as events, pictures, etc. The URL usage can be a good indicator of individual user accounts. Based on this assumption, we extracted the features below:

  • average number of links per account;

  • average number of tweets with links per account.

Twitter-Specific Features.

Organization accounts are often characterized by a large number of subscribers (followers), but a relatively small number of subscriptions (following). This is also the case of popular personage accounts. Also, it is worth mentioning that corporate accounts are often verified, which often does not hold for personal accounts, while personage accounts are almost never verified.

  • number of subscribers;

  • number of subscriptions;

  • if the account is verified;

  • average number of “favorite” tweets.

Overall, there we suggest 136 features for Twitter account type classification. It is worth mentioning that some of them (such as usage of hashtags, hyperlinks, and personal words) were never adapted before and, thus, they are one of the contributions of this study.

4 Experiment Setup

4.1 Data Collection

Due to the lack of publicly available datasets on Twitter account type inference, we collected our dataset. To do so, we developed a crawler for downloading last n = 500 tweets of each specified user, where the list of account names was created manually.

4.2 Utilized Machine Learning Methods

We employed the following commonly-utilized classification baselines that are implemented as part of WEKAFootnote 18 machine learning library: k-nearest neighbors, Naïve Bayes classifier, Support Vector Machines (SVM) classifier, Decision Trees (its C4.5 version), and Random Forest. These algorithms were applied to the profiles represented by our-extracted POP-MAP features that were presented in Sect. 3.

We used several feature selection (FS) algorithms [27] to select only representative features:

  • dependency-based elimination, such as: CFS-BiS, CFS-GS, CFS-LS, CFS-RS, CFS-SBS, CFS-SFS, CFS-SWS, CFS-TS;

  • consistency-based elimination, such as: Cons-BiS, Cons-GS, Cons-LS, Cons-RS, Cons-SBS, Cons-SFS, Cons-SWS;

  • Significant algorithm, which is based on estimating feature “significance”;

  • ReliefF measures feature importance based on comparison to similar objects of the same class.

In addition, we utilized the well-known dimensionality reduction algorithm PCA that is also implemented in WEKA.

To evaluate the prediction performance by using the two well-adopted evaluation measures: accuracy and F-measure. We organized model evaluation using 5-fold cross validation.

5 Experiments on Russian Text Corpora

We have collected the sample consisting of 298 Russian personal accounts, 160 Russian organization accounts and 151 Russian personage accounts by the tool and method, described in the previous section.

5.1 Comparing Baselines

Since there are no existing solutions for the problem of microblog account type inference, we consider standard text classification techniques as our baselines:

  • Naïve Bayes (NB) is a simple Naïve Bayes classifier with minor preprocessing (all hyperlinks are removed and letters are changed to lowercase) [8].

  • Classifier with stemmer (Stemmer) is NB with Porter’s stemmer applied [21].

  • Classifier with emoticons (Emoticon) is the classifier from Lin [18] work, which determines chat users’ age and gender based on emoticons in users’ posts. To implement this method, we identified 500 different emoticons.

The baseline results are presented in Table 3. As we can see, stemming has expectedly improved NB but outperformed Emoticon. This is possibly due to organizations use less formal language in Twitter than we expected.

Table 3. Results of baselines for account classification for the Russian language.

5.2 Comparing Approaches Trained POP-MAP Features

MAP without Feature Selection.

We conducted experiments using the setup described in Sect. 4 on the collected dataset. The results are presented in Table 4. The best performance was shown by Random Forest, which is consistent with previous study [12] and can be explained by its feature selection ability.

Table 4. Results for account classification for the Russian language without feature selection.

POP-MAP with Feature Selection.

To improve classification performance, we applied dimensionality reduction algorithms described in Sect. 4. First, we applied PCA. As we can see from Table 5, PCA did not improve the classification performance.

Table 5. Results for account classification for the Russian language with PCA.

Then we picked the best feature selection algorithm for each classifier with respect to the resulting performance. The evaluation results are presented in Table 6. As it can be seen, feature selection improved performance of all the models. However, Random Forest kept its position of the best classifier, which can be explained by its additional built-in feature selection ability.

Table 6. Results for account classification for the Russian language with feature selection.

5.3 Results Summary

From the Table 6, it can be seen that the best performance was achieved by Random Forest classifier on the CFC-TS-preprocessed data. The contingency matrix is presented in Table 7 shows us that the resulting classifier makes a small number of misclassifications, while the most complex task for it is to distinguish between personal accounts and personage accounts. This can be explained by the similar nature of these two types of accounts, which conforms well with manual comparison of such accounts.

Table 7. Contingency table of the best classifier for the Russian language.

We used mutual information (MI) measure to estimate feature importance. The most valuable features are average number of personal words per account (0.679), average number of personal pronouns per tweet (0.633), average number of personal words per tweet (0.472), average number of links per account (0.402), and a number of subscriptions (0.378). Among other features with MI greater than 0.2, seven are POS features, one is tweets with links per account, two are tweets length features.

As we can see, the most important features are related to personality and references. We may expect the same situation and for the English language.

6 Experiments on English Text Corpora

6.1 Dataset

To perform evaluation on English corpora, we have collected the sample consisting of 281 English personal accounts, 130 English organization accounts and 130 English personage accounts using the tool and method described in Sect. 3.

6.2 Results

In this setup, we tested only Random Forest since it has shown the ultimate performance for the Russian language. The best-achieved result was after applying Con-GS algorithm selecting 44 features and resulting in 0.894 of accuracy and 0.879 of F-measure. The contingency table is presented in Table 8. The resulting classifier also makes only a small number of mistakes. As we can see, the classifier for English corpora outperforms the best one for Russian corpora classifier.

Table 8. Contingency table of the best classifier for the English language.

The most valuable features with respect to the MI are: number of subscriptions (0.709), average number of personal words per account (0.516), if the ac-count is verified (0.479), average number of tweets with links per account (0.290), average number of unique signs per account (0.274). Among other features with MI greater than 0.2, four are symbol features, one is number of subscribers, one is average number of hyperlinks per tweet, and one is average length of tweets.

We can see that personal words are also the strong feature besides Twitter-specific features. However, POS-tagged features are not at the top as in Russia. Instead, symbol-specific features are useful for English.

6.3 Results for Binary Classification

We also compared our results with results, reported in [23], where authors classified microblog accounts only into personal and corporate types. To do so, we selected only personal and organization accounts from the initial datasets and run the best-built classifiers for English and Russian. The results of the comparison are presented in Table 9. As it can be seen, the POP-MAP results on both the Russian and English corpora are similarly high and significantly surpass the behavior-based approach.

Table 9. Results of baselines for account classification for the Russian language.

7 Conclusion

In this paper, we addressed the problem of Twitter account classification. We described 136 features, which we then used in different classification models. We run experiments on corpora of Russian and English tweets and achieve similarly high classification performance for both languages with the Random Forest model.

However, we discovered that there is a difference in text feature importance for two languages, while Twitter-specific features have the same importance. The only exception is a strong feature related to personal words that are useful in both English and Russian.

The research is supported by the Government of the Russian Federation, Grant 08-08.