1 Introduction

The rapid growth of social media networks such as Facebook, Twitter and Instagram has given the opportunity to users to publish content, communicate and exchange information with others in a very fast way. Often, social media users hide their real identity such as their true names, genders and also their occupations in order to express their opinions freely. Some other users mask their real information for fraudulent and other erroneous acts. The huge amounts of unstructured texts and the easiness of posting content in social media networks make the task of identifying the user’s category (i.e., human, false profile, bot, etc.) difficult. Therefore, it has become very important to provide effective tools and methods to identify the true identity of users. An interesting field that has attracted the attention of several researchers is author profiling (AP). AP is a subtask of authorship analysis whose goal is to identify demographic attributes of authors such as age (Rangel et al. 2014), gender (Rangel et al. 2013) and personality traits (Pardo et al. 2015) by examining his/her written text. AP has a growing importance in several applications related to forensics purpose, security (Juola 2015), marketing (Pardo et al. 2015), psychology and terrorism prevention.

Twitter has become increasingly popular in the last two decades (Sendi et al. 2017). Twitter is a “microblogging” system that allows us to send and receive short and informal texts called tweets. It has a large number of users. In recent years, bots have become ubiquitous in Twitter (Chu et al. 2012; Davis et al. 2016; Subrahmanian et al. 2016). Twitter bots can be used for helpful purposes, such as broadcasting important content in different areas such as ideology, politics and marketing. They have been also used for different malicious purposes, such as spreading fake information, spamming, violating others’ privacy and distributing malware. In Bessi and Ferrara (2016), research shows that in 2016 US presidential Election, more than a one-fifth of tweets on Twitter came from bot accounts. Therefore, detection of social bots, especially Twitter bots, has become an important research domain across the globe for a variety of security purposes. As a result, it is important to develop AP systems capable of distinguishing bot profiles from human ones, as well as identify the gender of a human author. In this article, we present our approach for the PAN 2019 AP challenge (Pardo and Rosso 2019). The task here is to determine whether a tweet’s author is a bot or a human and also to classify the gender of human users as male or female for English and Spanish datasets. Further details about our method are presented in the following sections.

1.1 Goal and contributions

The study that we propose in this article introduces the problem of AP in social networks, especially in Twitter. This work aims to distinguish bots from humans and in case of a human, to predict the gender of authors in Twitter based on statistical features. The goal of this work is to establish which stylometry-based features help to best address this challenge. For this purpose, we used the corpus from the AP task of PAN 2019 (Pardo and Rosso 2019). In this context, various well-known machine learning algorithms have been used. The random forest technique was the best performing for this work. The main contributions of this paper are outlined below:

  • We propose a language-independent system for AP based on statistical features. Specifically, we provided empirical evidence regarding the pertinence of the style-based features for solving the posed task across Twitter;

  • We evaluate the performance of our approach using machine learning techniques;

  • We identify the most suitable classifier for AP application using statistical features;

  • We discuss three state-of-the-art methods proposed to address the same task. Additionally, we compare their performance against the results of our proposed approach.

1.2 Paper structure

The remainder of this article is structured as follows. We start by giving a brief overview of existing work on bots detection problem and gender identification tasks in Sect. 2. Section  3 presents the motivations for the proposed method. In Sect. 4, we introduce our approach in which we explain how we preprocessed the data and what types of features could be considered as significant indicators for both gender and bot detection, we also present different classifiers used in our implemented model and we detail the proposed algorithm used for the classification tasks. Section 5 presents the experimental setup. We precisely present the used dataset, the strategy of defining the hyperparameters for all machine learning algorithms used. The evaluation metrics employed are also described in this part, and the obtained experimental results are discussed. Section 6 is reserved to present some limits of our approach. Section 7 concludes the paper and suggests some future directions.

2 Related work

By analyzing users posts on social networks, several researchers were able to determine different users’ aspects such as gender, age and language. In recent years, the interest in the bot and gender profiling field of research has been increasing (Varol et al. 2017; Rangel et al. 2018), because numerous users tend to share false information, hate publications and fake news. In particular, bots disseminate often incorrect and false information. Therefore, their detection in social media is very much important. In Sect. 2.1, we provide a review of previous studies on the automatic bot detection task. In Sect. 2.2, other studies related to automatic gender classification task are presented. We mainly focus on the different features used in each study as well as the learning algorithms employed in the classification phase.

2.1 Automatic bot identification

Broadly speaking, several approaches have been proposed in the literature for finding different demographics of an author on social media. Content-based, emotion-based, topic-based and deep learning approaches are the main methods proposed by researchers in this field. In this section, we will provide an overview of relevant related works on bot detection. Random forest was the classification technique that has been proven to give the best performance for bot detection on Twitter (Lee et al. 2011; Varol et al. 2017; Singh et al. 2018).

In Fernquist et al. (2018), the authors applied a content-based approach that uses the random forest algorithm to recognize automatic behaviors and detect bots tweeting about the Swedish election. When training their model, they used 140 different features which were divided into two types: metadata and content-based features. The authors found that the most significant feature is produced by the calculation of the number of likes an account has given divided by the number of friends the account has. The second most significant feature they identified is the ratio between the number of followers and friends, followed by the time between retweets. This method achieved good performance. Chu et al. (2012) focus on the classification of human, bot and cyborg accounts on Twitter. They used different features such as those related to account properties, tweeting behavior and tweet content. For the classification phase, they applied the random forest algorithm and they obtained 98% accuracy for humans, 96% for bots and 91% for cyborgs. SentiBot (Dickerson et al. 2014) is a sentiment-aware architecture for identifying bots on Twitter. The authors used a combination of features including tweet syntax, semantics and user behaviors to distinguish human and social bots. To test this method, they employed a large dataset relating to the 2014 Indian election and trained it with several machine learning classifiers. They show that a number of sentiment related factors are keys to the identification of bots, significantly increasing the area under the ROC curve (from 0.65 to 0.73). BotOrNot (Davis et al. 2016) is a publicly available service via the website or via Python or RESTAPIs which aims to compute the botness of a particular user (i.e., the likelihood that an account is a bot). Authors in this paper used more than 1000 features which are grouped into six classes: network, user, friends, temporal, content and sentiment. BotOrNot’s classifier used random forest technique and achieved 86% accuracy in detecting bots and humans on Twitter. Oentaryo et al. (2016) profiled Twitter users and classified them into three broad categories: broadcast, consumption and spam bots. They developed a new systematic bot profiling framework using a rich set of numeric, categorical and series features as well as a set of classification models. Empirical studies showed that the diversities of timing patterns for posting activities constitute the key features to effectively identify the behavioral traits of different bot types. Based on the precision results, the authors showed that the random forest classifier outperforms the other machine learning algorithms tested with 84.32%. The authors of Alarifi et al. (2016) collected and manually labeled a dataset of Twitter accounts, including bots, human users and hybrids (i.e., tweets posted by both human and bots). In this work, random forest and Bayesian algorithms reached the best performance at both two-class (bot/human) and three-class classification (bot/human/hybrid). Recently, in Hall et al. (2018) the authors proposed a new strategy for bot detection in Wikidata using a set of comment-based features of user. The comments-based features help to examine the editing behavior of registered and non-registered users. For classification, the authors tested the random forest algorithm and a gradient boosting model and applied optimization by hyperparameter for both classifiers. They showed that this method yields a high level of fitness (ROC-AUC= 0.985). In the work of Cai et al. (2017), a behavior enhanced deep model (BeDM) for bot detection was proposed and applied to detect bots under two deep learning frameworks. This method regards user content as temporal text data instead of plain text and fuses content information and behavior information to capture the latent features. Using a honeypot method, the authors of this study collected a public Twitter-related dataset for their experiments using a convolution neural network (CNN) network. Their model was investigated by varying the number of filters, the filter width and the number of hidden units in the hidden layer. Experiments using tenfold cross-validation showed a result for precision, recall and F1-measure as 88.41%, 86.26% and 87.32%, respectively. In Wei and Nguyen (2019), the authors developed a recurrent neural networks (biLSTM) word embeddings-based model to classify human and spambot accounts on Twitter. This model requires no prior knowledge or handcrafted features. Experiments on the cresci-2017 dataset showed that their proposed approach achieved encouraging performance compared with existing state-of-the-art bot detection systems. They also confirmed the ability of using biLSTM with word embeddings in their model to detect phishing email, webpages or SMSs. Kudugunta and Ferrara (2018) proposed also a deep neural network based on contextual LSTMs architecture that exploits both content and metadata to detect bots on Twitter. There are two classification tasks in this study: account-level bot detection and tweet-level bot detection.

2.2 Automatic gender classification

Several studies have been carried out on the classification of users’ gender on social media. In previous studies, many researchers have explored different methods such as content-based, stylometry-based, emotion-based and deep learning models to address this issue. In this section, we will provide an overview of these relevant related works. This can provide a basic understanding on the state-of-the-art of gender classification methods and approaches. In Isbister et al. (2017), the authors proposed a content-based approach to classify the gender of a blog author on five different languages: English, French, Spanish, Swedish and Russian. They used features from Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al. 2001). Experimental results showed that by using support vector machine with features from LIWC, the authors could obtain an accuracy between 73 and 79% depending on the language. In Daneshvar and Inkpen (2018b), authors presented a new content-based approach for gender identification in Twitter. They proposed a linear support vector machine classifier, with different types of word and character n-grams as features. This model was the best-performing model in textual classification, with the accuracy of 82.21%, 82% and 80.9% on the English, Spanish and Arabic datasets, respectively. Fatima et al. (2017) investigated content-based features (word and character n-grams) and 64 stylometry-based features (11 lexical word-based, 47 lexical character-based and 6 vocabulary measures) for the identification of gender and age traits on multilingual corpora. For the classification, several machine learning techniques were used. For gender prediction, best results were obtained using random forest and naive Bayes classifiers. Patra et al. (2018) described a multi-modal author profiling systems using Twitter datasets in three different languages: Arabic, English and Spanish. This system aimed to identify the author’s gender. An existing image captioning system was used to extract information present in images. As features, the authors extracted mainly latent semantic analysis, word embeddings and stylistic features from tweets as well as captions. Among three languages, the proposed model for English outperformed the other two languages with an accuracy of 77.37% using support vector machine. Rangel and Rosso (2016), Rangel and Rosso investigated the impact of emotions on gender and age identification. The authors proposed a graph-based approach to model the way people use the language and the emotions when writing. The graph was built with the different parts of speech of user’s texts and enriches it with semantic information with the topics they speak about, the type of verbs they use and the emotions they express. This approach that employs both style-based and EmoGraph-based features allowed to identifying gender with an acceptable accuracy using support vector machine classifier. In Safara et al. (2020), Safara et al. used character-based features, syntax-based features, word-based features and structure-based features to detect the gender of an email author. An artificial neural network (ANN) was employed as a classifier, and the whale optimization algorithm (WOA) was used to find optimal weights and biases for improving the accuracy of the ANN classification. Their proposed model achieved an accuracy of 98%. In Garibay et al. (2015), the authors described their methodology for age and gender prediction in Twitter. The authors presented each tweet with a bag of words in a vector space. TF-IDF measure was used to assign a value to each word in a vector. The proposed system presented some failures with the classification of the gender class which affected its performance. To justify this failure, the authors believed that the training of models was over-fitted by the number of estimators in both classification and regression random forest models used. In Takahashi et al. (2018), authors described a neural network model for the gender identification on Twitter named “Text Image Fusion Neural Network (TIFNN).” This method consists of extracting information from written messages and images shared by users. In order to leverage the synergy of the texts and images, the proposed model computes the relationship between them using the direct product. TIFNN achieved accuracies of 84–90% for the different languages used. Basil et al. (2019) tackled the issue of age profiling using convolutional networks. They used datasets collected from Facebook and Twitter. This dataset has been tested the deep convolution neural network algorithm (DCNN) for the classification of age group as a two class such as teenager and adult age group. DCNN had the best performance, reaching a precision of 89% in the validation tests. Another interesting work relying on deep learning models was proposed in Mac Kim et al. (2017) where the authors address gender, age and user type (“individual,” “organisation” and “other”) profiling on Twitter data. Authors tackled all tasks as a graph vertex classification task using GRNNs based. For this purpose, they employed recursive neural networks and they achieved good performance. In Mabrouk et al. (2020), the authors proposed a new ontology-based profile categorization approach. In order to train a gender predictor systems, experiments using the gender classification dataset 2018 confirm the performance and the effectiveness of this method.

To conclude, several researchers have developed various methods to solve the bots detection problem and the gender identification task. Random forest was the classification technique that has been proven to give the best performance for bot detection on social networks, especially on Twitter. In the next section, we present our motivation for the proposed approach for bot and gender profiling in PAN 2019 (Pardo and Rosso 2019). The main objective of our system is to show how the writing style employed by users in online social media helps us to profile them. More precisely, how can writing style of user differentiate between bots and humans, as well as identify her/his gender.

3 Motivations for proposed approach

The proposed approach solves the problem of the classification of Twitter messages in order to detect bot profiles from human ones. Then, in case of a human, identify her/his gender (man or woman). The strong motivations behind our own work are illustrated in the following points: (1) Bots have become ubiquitous in online social media networks, especially in Twitter during the last 15 years (Dickerson et al. 2014; Davis et al. 2016; Subrahmanian et al. 2016). Therefore, detecting Twitter bots has become an important research domain across the globe for a variety of security purposes. (2) Classification of Twitter messages is especially a challenging problem compared to the classification of long texts such as scientific articles and newspapers (Naouar et al. 2017) that has been the focus of research effort during the last two decades. Indeed, their informal style makes classification hard because there is very little writing style information in short texts (Omri 2004b, a). Therefore, the majority of researchers focus on the content of tweets (Mehrotra et al. 2013; Najib et al. 2015; Nieuwenhuis and Wilkens 2018; López-Monroy et al. 2020) more than using the writing style of users to classify Twitter messages. Therefore, we present our system in order to determine which kinds of stylistics information have the greatest impact to solve that issue. We used a set of new statistical features based on the writing style of the authors. (3) Motivated by the positive results of statistical approaches (Fkih and Omri 2013) and traditional machine learning models for various classification problems (Sendi et al. 2019; Fkih and Omri 2020; Boukhari and Omri 2020), we present in this paper our proposed method-based machine learning. Random forest is one of the most popular supervised machine learning techniques and shows relatively superior generalization performance for many classification tasks in recent years (Surendran et al. 2014; Garibay et al. 2015; Maitra et al. 2016; Kyebambe et al. 2017).

4 Proposed approach for Author Profiling

4.1 General concept of the proposed approach

In this section, we describe our proposed model used to solve the task of bot and gender profiling in PAN 2019 (Pardo and Rosso 2019). This method is designed to identify two types of classes: bot and gender. Bot and gender classification are considered as separated problems. We determinate firstly whether an author is a human or a bot, and then, we predict whether an human is a male or a female.

Figure 1 shows the general process of this study, which is composed of three major parts. The first part is data preprocessing. All operations in this step were done using the Natural Language Toolkit (NLTK). The second part is related to features extraction, which is the major part of our study. We focus on features based on the writing style of authors to solve the present problem. The features are divided into four categories: character-based features, word-based features, syntax-based features and Twitter features. These features have been adopted in previous text classification works and proved effective for online text classification (Flekova et al. 2016; Ashraf et al. 2016; Guimaraes et al. 2017; Safara et al. 2020). The final part of this paper focused on Twitter users classification. This part is composed of two main phases: bots detection and gender classification, i.e., distinguishing bots from humans and in case of a human, identifying his/her gender (male or female). In this part, the prepared features were considered to be an input for different supervised machine learning techniques. We found that the random forest classifier which is frequently used for online text classification outperforms all other algorithms in terms of performance. We note that we used the same concept for bot and gender classification for each respective language. More details about our proposed approach are presented in Sect. 4.2.

Fig. 1
figure 1

Overall process of the proposed approach

4.2 Detailed procedure

4.2.1 Preprocessing and text analysis

To answer the second research question, we describe in this section the different steps we did in the preprocessing phase. The preprocessing of the data is an important step, and its goal is to clean data and to make it ready for applying machine learning classifiers to it. The first step we did in this phase is the concatenation of each user’s tweets in order to have only one document for a single author. Then, we perform some transforming tasks of the tweets (tokenization and lemmatization). In addition, we remove digits, we remove also stopwords (such as “the,” “and” and “for” in English and “es,” “por” and “el” in Spanish) from tweets because they don’t provide any useful information for the classification task. There is a default list of stopwords in the NLTK library in several languages, especially English and Spanish. So, in this work, we delete the stopwords provided by NLTK library. It is important to mention that we did not remove punctuation marks, emojis and other special symbols, because these lexical structures are helpful in bot and gender detection and will be used in the present work.

All operations in the preprocessing step were done using NLTK library, a suite of program modules written in Python.

4.2.2 Features extraction

Features extraction and selection from the data are a very critical process for AP task and had a great impact on the experimental results. Previous works on the AP task explore different classes of features for bot and gender detection. To the best of our knowledge and according to the literature, stylometry features are very helpful in distinguishing between tweets written by different categories of people (bots, humans) (Flekova et al. 2016), because online social media users may have different writing styles. They may share texts by using uppercase or lowercase letters, repeating characters or words in one sentence, employing some special symbols and emojis and so on. For example, an important difference between a human profile and a bot profile, that we noticed from the two datasets, is the usage of hashtags (#) and mentions of users (@). We think that bots use to insert hashtags in their texts as a way to increase their reach, and they try to mention multiple users to call their attention. In addition, bots retweet content of others human users to ameliorate their Twitter accounts. Also, they share a high amount of links in their publications compared with human, and this is a typical behavior of spam bots. Considering the above different factors aforementioned, we assume that the statistics of these special tokens in user-generated texts may differentiate bot profiles from human ones, as well as predict the gender of human users. Our proposed approach is purely statistical and based only on tweets from the Twitter users. It accepts input from any document written in English or Spanish, i.e., features used can be applied on any language for bot and gender detection. In our proposed method, we used the same set of features for the bot detection and the gender prediction tasks for the two different languages.

The features used in our study are divided into four categories: character-based features, word-based features, syntax-based features and Twitter features. Each category has its own features as explained in the followings (for details, see Table 1):

  • Character-based features: number of characters, number of lower cases, number of upper cases.

  • Word-based features: number of words, number of short words , number of long words, number of capitalized words, number of words that only once occur, number of words occurring twice.

  • Syntax-based features: number of special characters, number of punctuation marks.

  • Twitter features: number of retweets, number of mentions, number of tweets mentioning the word bot, number of URLs, number of emojis, number of hashtags.

Table 1 List of the features used in the development of our proposed system for PAN 2019 bot and gender profiling task

Tables 2 and 3 show some examples of bots and human tweets extracted from the English and the Spanish datasets, respectively, before the preprocessing step.

Table 2 Some tweets extracted from the English dataset
Table 3 Some tweets extracted from the Spanish dataset

For the two languages, we confirmed that bots employed links and hashtags in their posts and shared content of others profiles (retweets) more than humans. To emphasize on the different features we employed in our study, Fig. 2 shows an example of English tweet, written by human, before and after preprocessing.

Fig. 2
figure 2

Illustration of text post-processing and preprocessing

This is an English tweet written by a male user, containing 40 characters (lower cases), 6 words, 3 long words; punctuation-count is 2, emoji-count is 3; there are no hapax legomena, no hapax dislegomena, no using of the word “bot,” no capitalized words, no links, no hashtags, no short words, no upper cases, no retweets, no mentions and no special characters.

4.2.3 Proposed algorithm

Motivated by the positive results of traditional machine learning models for AP challenge (mentioned in Sect. 2), we explore in this work how these strategies would provide good results in the task at hand, i.e., in bot and gender author profiling task. To this end, our proposed approach relies on applying supervised machine learning algorithm: “Random Forest.” We chose to train our model with this algorithm because of its best performance in recent years for AP problems and other various domains (Akar and Güngör 2012; Chu et al. 2012; Dong et al. 2013; Inuwa-Dutse et al. 2018; Yang et al. 2019; Li et al. 2020).

Random forest (Breiman 2001) is a supervised learning technique which is widely used for classification problems. In this technique, bootstrapping strategy is employed to partition the set of features into multiple training subsets. Then, the algorithm trains a model (decision tree) on each subset in the training data and gets a prediction result from each decision tree. Then, it performs a vote for each predicted result. Finally, it selects the prediction result with the most votes as the final prediction. Figure 3 shows the general workflow of random forest algorithm. The pseudo-code of our proposed model is shown in Algorithm 1, including the preprocessing phase, the features extraction phase, the step sequence of the random forest algorithm as well as the evaluation metrics which are defined. The random forest algorithm works as follows: For each tree in the forest, we select a bootstrap sample from T where \(T^{(i)}\) denotes the ith bootstrap. Then, at each node of the tree, instead of examining all possible feature splits, we randomly select some subset of the features \(m \subseteq M\), where M is the set of features. The node then splits on the best feature in m rather than M. In practice, m is so much smaller than M. Here, we define the training set T by:

$$\begin{aligned} T=\left\{ ({x_1}, {y_1}),\dots , ({x_n}, {y_n}) \right\} \end{aligned}$$
(1)

To obtain the best model, it is necessary to adjust the parameters of the algorithm. For random forest, the most important setting, is: n_estimators is the number of trees in the forest. A higher number of trees give better performance but increase the complexity of prediction in terms of time (i.e., it makes the code slower). It should choose as high value as our system can handle because this makes predictions stronger and more stable. Empirically and after some experimental steps, we found that a good setup for both gender and bot classification was: \(n\_{estimators}=100\).

figure a
Fig. 3
figure 3

Workflow of random forest algorithm

It is important to note that, in this study, we rely only on our best performing model (i.e., random forest (RF)). However, we find it important for our next research in this area to train other different models with the same inputs data. So, we examined other machine learning techniques such as support vector machine (SVM), logistic regression (LR) and k-nearest neighbors (KNN). A CNN architecture was also tested in this paper. For both languages and for the two tasks (bot detection and gender identification), the best found configuration of all used classifiers is presented in Sect. 5.2.

5 Experimental study and results analysis

In this section, we present the datasets used to approach the task in question (Sect. 5.1), the strategy of defining the hyperparameters for all machine learning algorithms used (Sect. 5.2), the different parameters used for the performance evaluation (Sect. 5.3). Finally, the experimental results are discussed (Sect 5.4).

5.1 Dataset description

We used the PAN-AP-19 dataset (Pardo and Rosso 2019) to train our proposed system. It consists of tweets from different Twitter users in the form of XML files; each one corresponds to an author and contains 100 unprocessed tweets. This dataset was made available for English and Spanish languages. The English training corpus contains 4.120 author profiles, whereas Spanish training corpus contains 3.000 author profiles. The dataset is perfectly balanced between bots and humans in both languages. The data were previously split in train and dev for both English and Spanish languages, train for the training and dev for the test or validation phase. Table 4 shows more statistical details about the datasets used.

Table 4 Distribution of data in the PAN19-author-profiling-training corpus for bot and gender profiling task

5.2 Hyperparameters tuning of the algorithms used

The hyperparameters configuration is an important step, because a bad choice of setting can influence our results in terms of performance. At first, the hyperparameter tuning was done by hand. This gives poor findings. So, we opted to explore the use of the grid parameter search. After some experimental steps using the grid parameter search, to find the best hyperparameter configuration for all tested classifiers, we define each parameter as follows:

  • For RF classifier: n_estimators=100,

  • For LR classifier: C=1e2 and fit_intercept=False,

  • For SVM classifier with linear kernel: C=1.0,

  • For SVM classifier with RBF kernel: C=1.0 and gamma=0.1,

  • For KNN classifier: N_neighbors=500,

  • For CNN classifier: Convolutional filter size=64, Convolutional kernel size=4, MaxPooling: pooling size=4, Dropout rate=0.5, Dense layer=256 units, Layers Activation function= ReLu, Optimizer: Adam, Learning Rate: Adaptive (\( 0.001 \longrightarrow 0.00001 \)), Loss function: Binary crossentropy, Batch Size=32.

5.3 Evaluation metrics

Evaluation is carried out using four standard measures including accuracy, precision, recall and the F1-score. We can note that the greater the precision, the accuracy, the recall and the F1-score are, the better the model performance. The different parameters are defined as follows:

Accuracy is the ratio of number of correct predictions to the total number of input samples, i.e.,

$$\begin{aligned} {\text {Accuracy}} = \frac{{\text {TN}}+{\text {TP}}}{{\text {TN}}+{\text {TP}}+{\text {FP}}+{\text {FN}}} \end{aligned}$$
(2)

Precision is the ratio of correct positive instances among the total of the positive instances, i.e.,

$$\begin{aligned} {\text {Precision}} = \frac{{\text {TP}}}{{\text {TP}}+{\text {FP}}} \end{aligned}$$
(3)

Recall is the fraction of correct positive instances over the total of all relevant samples and is computed as:

$$\begin{aligned} {\text {Recall}} = \frac{{\text {TP}}}{{\text {TP+FN}}} \end{aligned}$$
(4)

F1-score is approximately the harmonic mean between precision and recall measures, i.e.,

$$\begin{aligned} {\text {F1-score}} = \frac{2*{\text {Precision}}*{\text {Recall}}}{{\text {Precision}}+{\text {Recall}}} \end{aligned}$$
(5)

where TP, TN, FP and FN are true positive rate, true negative rate, false positive rate and false negative rate, respectively. Tables 5 and 6, called confusion matrix, summarize all these measures for bots detection and gender identification, respectively.

Table 5 Confusion matrix for bots detection
Table 6 Confusion matrix for gender classification

5.4 Results and discussion

Based on the proposed features, the set of binary classifiers were trained using the English and the Spanish datasets and were applied for bots and gender detection. For each dataset, we performed a tenfold cross-validation on all classifiers. Indeed, cross-validation technique has been shown empirically to yield encouraging results (Daneshvar and Inkpen 2018a). In the subsections below, we first present the performance of our proposal compared to other traditional machine learning algorithms. Second, we compare our approach with other works using the same datasets.

All of our experiments are conducted on an Acer Aspire E5-573 laptop with 2.4 GHz Core i5 5th generation CPU and 16 GB of RAM. For the software environment, we use Windows 10, 64 bit, and a recent version of python environment (Python 3.6).

5.4.1 Comparison against other approaches using the same features

To compare and evaluate our proposed method, data mining techniques such as SVM, LR, KNN and CNN were examined on the same datasets using the same set of features.

After running the random forest algorithm as well as the other supervised machine learning classifiers, Tables 7, 8, 9 and 10 and Figs. 4, 5, 6, 7, 8, 9, 10 and 11 summarize our findings. Indeed, our proposed approach succeeded in finding solution for bots and gender detection in English and Spanish languages using statistical features with the training datasets. As can be seen in results tables, the classification performance of the RF technique (reported in bold) exceeds all other methods results, for both English and Spanish languages, in terms of accuracy, precision, recall and F1-score. Using this supervised classifier, for English language, accuracies of 92.45% and 90.36% are obtained for bot and gender classification tasks, respectively. In terms of recall, bot detection achieved 92.09% and gender classification attained 90.62%. 92.76% and 90% of precision are shown for bot and gender classification tasks, respectively. F1-score of 92.43% was achieved for the bot classification task and F1-score of 90.31% for the gender detection task. On the other hand, for Spanish language, accuracies of 89.68 % and 88.88% are obtained for bot and gender classification tasks, respectively. In terms of recall, bot detection achieved 89.69% and gender classification reached 89.26%. 89.32% and 88.67% of precision are shown for bot and gender detection, respectively. F1-score of 89.5% was achieved for bot detection task and F1-score of 88.69% for gender classification. SVM and LR classifiers also achieved acceptable performance for both English and Spanish datasets (73.82–89.41%). With regard to the CNN algorithm, it also achieved encouraging performance especially for bot identification on the English language (with accuracy of 89.70%). Bad results are obtained using the KNN algorithm.

Table 7 Accuracy results on bot and gender detection
Table 8 Precision results on bot and gender detection
Table 9 Recall results on bot and gender detection
Table 10 F1-score results on bot and gender detection
Fig. 4
figure 4

Comparison of the performances of all models applied in terms of accuracy using English dataset

Fig. 5
figure 5

Comparison of the performances of all models applied in terms of precision using English dataset

Fig. 6
figure 6

Comparison of the performances of all models applied in terms of recall using English dataset

Fig. 7
figure 7

Comparison of the performances of all models applied in terms of F1-score using English dataset

Fig. 8
figure 8

Comparison of the performances of all models applied in terms of accuracy using Spanish dataset

Fig. 9
figure 9

Comparison of the performances of all models applied in terms of precision using Spanish dataset

Fig. 10
figure 10

Comparison of the performances of all models applied in terms of recall using Spanish dataset

Fig. 11
figure 11

Comparison of the performances of all models applied in terms of F1-score using Spanish dataset

Graphic comparison shows also that all supervised machine learning algorithms used obtain higher scores when applied for the bot classification than when the same algorithms are applied for gender classification. This difference can be justified by the small training data sizes for gender classification. Also, we can note that our proposed features are more suitable for distinguishing bots from humans than for gender detection. Therefore, the gender classification task is more difficult than the bots detection task. This is likely to happen because bots are likely to insert URLs into their shared tweets and mention others profiles, and humans share texts without more usage of URLs and mention users. Consequently, it makes it easier for the classifiers to distinguish bots from human. Regarding the language of the used datasets, it can be noted that results for English data are higher compared to Spanish, even though same features are extracted for both languages. We can justify this by the complexity of the Spanish data; indeed, Spanish tweets in the datasets of the PAN 2019 bot and gender profiling task (Pardo and Rosso 2019) contain multilingual texts. Therefore, bot and gender detection with Spanish data is more difficult than with English data.

To sum up, according to our results, it makes it easier, for the classifiers applied in our study, to distinguish human from bot in English language.

5.4.2 Comparison against other approaches using the same random forest technique

To assess the efficiency of our approach and demonstrate how does our contribution surpass the existing ones, three state-of-the-art methods proposed to solve the identified task (i.e., the PAN 2019 bot and gender profiling task), that use the same RF technique, were compared with our method.

The baseline systems for the bots and gender profiling task used in our comparative study are presented by: (1) Goubin et al. (2019), (2) Puertas et al. (2019) and (3) Ashraf et al. (2019). This comparison is shown in Table 11 which reports accuracy results. Accuracy was the metric used in these different works, and since we do not have access to the results of precision and recall of other participants, F1-measure could not be calculated. In Goubin et al. (2019), the authors used RF classifier with different features such as number of URL, number of hashtags, number of emojis and number of pronouns in order to predict whether the users were bots or humans. Then, for the gender classification task, they used TF-IDF model based on 1- and 2-word n-grams as features and show that the best performance was reached by the SVM algorithm. In Puertas et al. (2019), for automatic bots identification, the authors used words, number of hashtags, mentions, URLs and emojis as features. For the gender identification subtask, they included the average, kurtosis and asymmetry of the counts of hashtags, mentions, URLs and emojis. They also analyzed the lexical diversity comparing the words used in one tweet to the words used in the rest of the tweets. After some testing on different classifiers, in the case of the English language RF obtained better performance for bots and gender. And for the Spanish language, RF had better accuracy for bots, while LR had better accuracy for genre. Finally, in Ashraf et al. (2019), Ashraf et al. used 27 stylometry features such as number of spaces, number of words and length of tweet for both bots and gender profiling tasks. They tested a range of classifiers including LR, RF, LinearSVC, BernoulliNB, MultinomialNB and SVC. The best results are obtained using RF classifier for both English and Spanish languages. Table 11 summarizes the performance results of all these methods.

Table 11 Comparison of classification results in terms of accuracy of our method with other works provided by Goubin et al. (2019), Puertas et al. (2019) and Ashraf et al. (2019)

As shown in Table 11, the results obtained with our proposed approach are better than the baselines provided by the task organizers mentioned above. Using the same RF algorithm, for the English dataset, our method worked strongly better than others baselines proposed by challenge organizers. Indeed, for bot detection, we achieve an accuracy of 92.45%. Accuracies of 90.34%, 88.07% and 92.27% were achieved, respectively, for the works of Goubin et al. (2019), Puertas et al. (2019) and Ashraf et al. (2019). On the other hand, for the gender classification task, again our approach outperformed all baselines in terms of accuracy (90.36%). Bad result was obtained in the work of Ashraf et al. (2019) with 75.83% of accuracy. For the Spanish dataset, none of the previous works have outperformed the results obtained by our approach using the same RF classifier. For bot classification, challenge organizers reached not very good accuracies (80.61–88.39%). Also, for gender prediction subtask, accuracies obtained by the same organizers were between 69.44% and 79.17%. To sum up, our method outperforms all the baselines and in particular seems more effective especially on English language.

An interesting observation needed to be noted is that all systems previously identified presented by challenge participants (including ours) had higher scores for bot detection than for gender identification. Therefore, we concluded that gender profiling turned out to be much more challenging than bot recognition. In that case, we can explain that this difference in accuracy is justified by the availability of a small English data size compared with the large training corpus used for the bot classification task. It is also worth noting that RF model tends to have a more stable improvements in performance than other baselines when it tested on our well-selected stylistic features. In conclusion, Table 11 confirms the robustness and the efficiency of our method compared to those of the models studied in the literature on the same problem.

6 Limits of the proposed approach

Our random forest-based proposed approach is simple and easy technique for bot and gender identification in Twitter, but not without some limitations. We see that our method presented some failures with the gender classification in Spanish language which affected its performance. The use of the same concepts in both bot and gender classification task can cause problem and affect the performance results. Maybe it was better to use different features for the two different tasks in order to achieve encouraging results to solve this problem. This strategy could be used in future work. Regarding the proposed classifier (i.e., random forest), one of the problems was the computing time. The random forest model requires much more time to train because of its large number of trees generated. The volume of the training data was another important problem in this challenge. Classification problems require a huge dataset to get a good prediction result. Other large Twitter datasets could be examined as well in order to show the effectiveness of the proposed method.

7 Conclusion and future works

In this paper, we present our statistical approach for the PAN 2019 bot and gender profiling task (Pardo and Rosso 2019). Our method consists of two steps. Firstly, we predict whether Twitter accounts are bot or human. Then, we identify the gender of a Twitter human user (i.e., male or female). For this purpose, we used the same features for bot and gender detection (17 stylometry-based features). One of the most significant challenges in this work is the preprocessing of the data. Despite the trivial features used to implement our system, we obtained promising results in both tasks. This demonstrates the importance of the preprocessing phase. For the classification, it was necessary to evaluate different tools with different parameters. We benchmarked our approach with various machine learning algorithms including random forests, support vector machines, logistic regression and k-nearest neighbors. The final experimental results show that our proposed stylometric features work well on such a kind of short and free style messages as Twitter posts. Random forest classifier was the most relevant techniques for both bot and gender detection. In addition, the results of the experiments confirm the robustness and the good performance of our method compared to those of the main models studied in the literature.

There are still issues to explore in this task, and our future work will therefore be structured around four directions. The first direction is to integrate other features in addition of statistical features; for example, we will try to model and extract new deep semantic features from tweets and to implement a hyperparameter optimization to tune the random forest-based proposed model. The second direction consists in using more powerful techniques for features selection such us Boruta algorithm in order to improve model performances. As a third direction, we plan to conduct more tests on other different standard data collections in order to definitively confirm the performance and robustness of our proposed model. The fourth direction is to test our approach with other models based on deep learning techniques in order to give academics and practitioners more amplification on how to deal with the problem of bot and gender profiling.