Keywords

1 Introduction

The present economic and social environment is characterized by a series of unexpected events having a major impact on the lives of people across the world [1, 2]. In this context, social media has become a meeting ground where people connect in real time, sharing ideas and information regarding the events that alter their daily lives [3, 4]. At the same time, social media has become an ideal data source that can be explored by both researchers and policy makers when trying to better understand the issues, fears and information needs of society. In the process of addressing these issues, knowing the audience is an important step for devising adequate policy [5]. Among the demographic characteristics of the users posting in social media, gender plays an important role since events can affect women and men differently [6, 7].

The specific problem we aim to solve with our contribution is the lack of information regarding the underlying demographics of text data sampled from social media sources. This lack of insight has widely been cited as an intrinsic issue with computational social science research [8,9,10]. An estimation of the actual underlying demographics would be valuable as it would allow an evaluation of how representative the sample is. Additionally it would allow a very granular approach to computational social science that could generate deeper insights into the many factors that correlate with certain opinions, such as the opposition towards vaccination [9, 11]. The development of a comprehensive demographic classification methodology, associated with the availability of appropriate training data, would push the field towards becoming a valuable complement to traditional methods such as surveys, with the advantage of being able to leverage a vastly superior number of data points.

As a result, the aim of our contribution is to suggest an improved method for estimating the gender distribution of a sample of online texts gathered from the microblogging platform Twitter using computational tools. We use publicly available datasets to train a series of classifiers for two sub-problems: the identification of a given name as opposed to a surname or other English word, and the identification of the gender of that given name. We obtain the best results using random forest (RF) classifiers for both sub-problems. We compare our approach to a baseline inspired by the PAN18 Author Profiling task [12] using the text component of the provided dataset. We validate our approach on domain data gathered by Banda et al. [13] during the COVID-19 pandemic, obtaining a gender distribution that matches the true estimated distribution of Twitter users [14]. Additionally, we extract and compare the top n-grams for each gender for the purpose of analyzing if there are meaningful differences among genders in the discourse related to the COVID-19 pandemic. We release the composite dataset used for the given name identification sub-problem for further research use.

The paper is structured as follows: Sect. 2 provides a brief literature review which supports the need for the current study, while Sect. 3 describes the data and methods used in the current approach. Section 4 analyzes the performance of the gender identification approach, with a focus on the results obtained on the selected COVID-19 dataset. The paper ends with concluding remarks and further research directions.

2 Related Work

Natural language processing (NLP) uses computational tools for operating on inputs in natural language. The field has evolved tremendously during the past few years, seeing the introduction of the Transformer neural network architecture [15] and the development of large transfer learning models such as BERT [16].

At the same time, there has been increased interest from fields associated with the social sciences in using these computational tools to analyze various aspects related to public opinion [1, 5, 9, 17]. The challenge with these approaches is that in most of the studies, demographics information is missing, the analysis being conducted on the entire dataset, without considering any differences that could exist in terms of gender, age, or ethnicity [18]. Thus, it is difficult to connect any findings back to the social context in which the social media discourses studied arose in the first place.

The NLP task that is concerned with extracting such information from text data is known as author profiling (AP) [19]. It aims to identify details about the user such as gender, age, native language, etc. [19,20,21]. An important subtask of AP is gender identification. This can be defined formally as the task of finding tuple <a,g > given any sample of text xi, where a is the author and g is the gender, g ∈ {female, male}. We have identified two main approaches to gender identification: intrinsic gender identification, when xi is one document out of a corpus X of annotated documents <xi, gi>, and metadata-based gender identification, when the document xi is a piece of information concerning the author, such as their name, occupation, place of employment, preferred pronouns, etc.

Approaches to gender identification can be grouped from a technical standpoint into dictionary-based [22], classical machine learning [19], and deep learning [21]. A comprehensive review that compares the results achieved on the PAN18 Author Profiling datasetFootnote 1 by the approaches described in 23 papers focusing on gender detection is included in [12]. Out of these, the best accuracy (82.21%) has been achieved by Daneshvar and Inkpen [23], where the authors used a Support Vector Machine classifier with a combination of different n-grams as features.

The gender identification method we propose is a metadata-based one, as by using the Twitter API it is possible to retrieve the public name field of any tweet’s author. Inferring a person’s gender from their name is possible because most European names are inherently gendered. It is important to note that this is not applicable to all languages and cultural contexts; for instance, not all Mandarin Chinese names can be assigned a gender [24]. Thus, great care must be taken to avoid using the name-based approach when dealing with non-European contexts and languages. In such cases, domain knowledge should be used to determine how well the approach fits local naming customs.

3 Data and Methods

3.1 Domain and Training Data

The domain data on which we aimed to validate our approach is the Large-Scale COVID-19 Twitter Chatter Dataset made available by Banda et al. [13]. This dataset contains 1.2 billion tweets related to COVID-19, collected between January 2020 and June 2021, presented in the form of a list of tweet IDs that can be used to retrieve each individual tweet from the Twitter API [13].

Due to issues of scale, we have further reduced the number of tweets by restricting the timeframe to the period between January 2021 and March 2021.We retrieved the name of the user who posted each tweet for gender identification. Our curated domain data contained 44,248,682 tweets from 6,999,706 distinct users. One of the difficulties with using the user’s name to identify their gender is that on Twitter, the name field is free text; as such, it can also contain non-name related tokens, such as titles, job-related information, political affiliation, preferred pronouns, etc. in addition to allowing the user to simply use a pseudonym. Nevertheless, after cleaning up the date and removing special characters, we have observed that many of the most common unigrams ranked by term frequency appear to be personal names (see Table 1.), lending credence to the fact that many users prefer to use actual human names instead of other signifiers. At the same time, the incidence of given names decreases when bigrams and trigrams are considered, with email addresses (“gmail com”), pandemic-related (“wear mask”) and political messages (“black lives matter”, “president elect”) becoming more common.

Table 1. Top-10 n-grams ranked by term frequency found in the name field.

The data appears to contain the name information, simply requiring special pre-processing. Pre-trained named-entity recognition (NER) models can be used to identify given name – surname tuples, but as the name order used on Twitter might be variable or interspersed with tokens such as “dr” or “mr”, these might not generalize well to arbitrary data and might skew the resulting distribution in ways we cannot easily explain, account or compensate for. Thus, we propose a novel, machine learning approach to given name identification, in which we evaluate each token individually and classify it as a given name or not.

Because to the best of our knowledge this exact technique has not been applied in the literature, we know of no publicly available dataset relevant to this task. Nevertheless, the n-gram analysis from Table 1 suggests the presence of at least four categories of signifiers in the field: given names (“david”, “john”, “michael”), surnames (“singh”, “de la x”), other English words (“dr”, “name”, “cannot”), and non-English words (“hu tao haver”, “sb ikalawang yugto”). As such, a dataset containing these classes of tokens can be constructed from other publicly available datasets and annotated automatically.

For this purpose, we have merged three separate datasets: for given names, we used the Gender by Name Data Set available in the University of California Irvine Machine Learning Repository, containing 147,270 personal names and the associated biological sex of the persons bearing those names, dating from between 1880 and 2019 and gathered from the US, the UK, Canada, and Australia [25]. For surnames, we used the Wiktionary Names Appendix Scraped Surnames Dataset containing 45,136 surnames from persons across the world, gathered from WiktionaryFootnote 2. Finally, for arbitrary English words, we used the 1/3 million most frequent words corpus [26]Footnote 3, containing the top 333,331 words used in the English language ranked by term frequency. Because certain tokens were present in more than one dataset, we removed all duplicates. The merged dataset, consisting of 476,089 tokens, can be accessed at: https://github.com/erkovacs/measuring-gender-a-machine-learning-approach-social-media-demographics-author-profiling-data. For the gender identification problem, we have used the Gender by Name Data Set individually.

3.2 Preprocessing and Tokenization

We have substituted all special characters from the data with their English phonetic transcriptions using the software package unidecodeFootnote 4. In addition, we substituted all punctuation with the character “*” and lowercased all the tokens in the dataset. We applied this same preprocessing to tokens in all categories, and for both sub-problems.

Fig. 1.
figure 1

Data representation steps with a toy feature set consisting of bigrams and trigrams.

For feature representation, we have used a character-level tokenization scheme based on the one proposed by Malmasi and Dras [27]. This tokenization scheme can capture sub-word structures that encode gender information at the level of names using a compact alphabet composed of unigram, bigram and trigram features with the addition of special tokens “*” mentioned above and “$”, marking the beginning or end of a string [27]. After building this feature set and transforming the data, we used the most representative 2048 features to build a document-term matrix for each token, as shown in Fig. 1.

3.3 Ensemble Classifier

The final ensemble pipeline we propose consists of two classifiers and decision points (Fig. 2.). The first one is the given name identification step, which takes a list of tokens for each author and classifies each token as a given name or surname/other word. Binary classification is sufficient for our purposes in this case because we are only interested in whether the token is a given name or not. The tokens that are classified as given names are kept in the list; all other tokens are removed.

Fig. 2.
figure 2

Illustration of the ensemble method steps and decision points.

Users with zero tokens identified are labelled as “anonymous” and will not be evaluated for their gender during the next step. For the remaining users, each given name identified in the previous step is identified as female or male and the predictions averaged over the number of tokens. For simplicity the decision threshold was set at 0.5.

3.4 Benchmarking

In order to compare our metadata-based approach to an intrinsic gender identification approach, we have decided to use the English text component of the PAN18 Author Profiling dataset [12], containing 4,900 users annotated as female or male and 100 tweets from each user. We have chosen this dataset because of its proximity to our own application and because it has also been collected from Twitter. The original task envisioned using multimodal techniques for author profiling, with the data also including several pictures for each user, but we considered that this machine vision element does not fit the scope of our work and as such, we limited ourselves to using the text component only for a fair comparison.

We fine-tuned a BERT [16] classifier on the full English part of the dataset for 10 epochs, with a maximum sentence length of 16 tokens (the average sentence length being 20.45 tokens), and a learning rate of 55e–6, chosen by running several training cycles with different learning rates we have experimented with in the past [9]. To match our approach from Sect. 3.3, we merged the train and the test data made available by the organizers and performed 5-fold cross validation. The model obtained an F1 score of 79.40% and an accuracy of 79.38%, an above-average performance considering the results reported by Rangel et al. [12]. We then removed all duplicate tweets, being left with 12,432,935 unique tweets, and used the model trained on the PAN18 AP dataset to predict the gender of the user for each of these.

4 Results

4.1 Classifier Evaluation

We have trained multiple classifiers, using both classical machine learning and deep learning, for each of the two sub-problems: random forest (RF), support vector machine (SVM), multinomial naïve Bayes (NB), a feedforward neural network (FFN), a recurrent neural network (RNN), and a long short-term memory network (LSTM). We compared the classifiers using the F1 score (Eq. 3), computed as the harmonic mean between precision (Eq. 1) and recall (Eq. 2). All values given are mean values obtained over 5-fold cross-validation.

$$recall = \frac{TP}{{TP + FN}}$$
(1)
$$precision = \frac{TP}{{TP + FP}}$$
(2)
$$F_1 = 2 \cdot \frac{{precision \cdot recall}}{{precision + recall}}$$
(3)

For the given name identification sub-problem, we have obtained the best results using the RF classifier with n-gram counts as features (see Table 2). Despite experimenting with different architectures and hyperparameter tuning, the deep learning models have not been able to surpass the performance of some of the classical machine learning algorithms. This highlights the continued relevance of these models, especially if well-fitting feature sets can be found. It is worth mentioning that these models are much faster to train than their deep learning counterparts and have much more modest hardware requirements.

Table 2. Classifiers performance in the case of the given name identification sub-problem.

For the gender identification sub-problem, it is also the RF classifier with n-gram count features that obtained the best results (see Table 3). The same underperformance in the case of the deep learning models can be seen here as well. It is likely that the selected feature set [27] is not well-suited to these models.

Table 3. Performance of the classifiers in the case of the gender identification sub-problem.

In comparison to a purely dictionary-based approach, we expect this classifier to capture sub-word structures common to given names, allowing it to generalize better to new data. To test this hypothesis, we have gathered a list of 70 fictional character names from the online computer game World of WarcraftFootnote 5. This game takes place in a medieval fantasy setting and as such most characters have invented names that reflect their in-game culture and ethnicity. Nevertheless, most of these names contain the same sub-word structures as real-life names, allowing the classifier to obtain an 84.29% accuracy. A sample of the predictions, both accurate and inaccurate, can be seen in Table 4.

Table 4. Examples of predictions on fantasy names.

In both cases it should be noted that all classifiers had good performance, lending credence to the hypotheses that human given names are sufficiently morphologically distinct from other English words to be easily learnable by the classifiers, and that gender information is encoded at the level of the form of given names.

4.2 Discussion

By applying the best performing classifier to the COVID-19 domain data, described in Sect. 3.1, the predicted distribution of the users by gender is as follows: 30.76% are anonymous users, 33.71% have been classified as female, and 35.53% have been predicted as male (Fig. 3).

Fig. 3.
figure 3

The distribution of users by gender.

When excluding anonymous users, the predicted distribution (48.69% female, 51.31% male) is very close to the empirically-observed distribution (43.60% female, 56.40% male) [14]. The distribution obtained by the benchmark model is 49.58% female and 50.42% male, which is also very close. Note however that the benchmark model cannot distinguish anonymous users at all.

It is noteworthy that among the anonymous users, Black Lives Matter-related messages and COVID-19-related messages appear to have been preeminently featured in their names (see Table 5).

Table 5. Top 10 n-grams by term frequency from the name field of users marked as anonymous

At the same time, the n-gram analysis performed on both the predictions produced by our pipeline and the benchmark model reveals that our approach has significantly more female names in the top 15 n-grams (except for “mike”) than the benchmark model, which actually has many male names (see Table 6). The same phenomenon, albeit less pronounced, can be seen at the level of the users predicted as male. This issue in the case of the benchmark classifier can be caused by the fact that its predictions take only the text into account, and as such many users who are of a given gender but have an anonymous Twitter presence have been included, resulting in an incorrect correlation. It is also possible that the model simply did not generalize well from the PAN18 dataset to the domain data. Limitations such as these show that the two approaches can be used either independently or as complements, depending on the aims of the research and the available data.

Table 6. Top 15 n-grams by term frequency from the name field of users.

Furthermore, an analysis of the top-20 n-grams has been performed on the tweets for which the author has been classified as male or female. In the case of unigrams and bigrams it has been observed that the top-20 n-grams are highly specific to the topic of the dataset, namely the the COVID-19 pandemic (e.g., “‘covid”, “vaccine”, “coronavirus”, “pandemic”, “death”, “covid vaccine”, “covid pandemic”, “covid vaccination”, “wear mask”, “get covid”, etc.), with the same n-grams being present in the tweets written by both female and male authors.

Fig. 4.
figure 4

Top 20 trigrams by TF-IDF score from the text of the tweets.

A significant difference between female and male written tweets becomes visible in the case of the top-20 trigrams. Thus, while the first nine trigrams are common in the discourse of both genders, as they pertain to general topics related to COVID-19 (e.g., “get covid vaccine”, “new covid case”, “tested positive covid”, etc.), for the rest, differences can be noted among genders. While female authors focused their speech on encouraging the signing of a petition (e.g., “petition via ukchange”, “petition via change”) and expressed concern regarding the safety of children (e.g., “prioritise teacher school”, “teacher school childcare”, “school childcare staff”, “childcare staff covid”, “staff covid vaccination”), the discourse of male authors revolves around relief funds (e.g., “covid relief bill”) and the vaccination process (e.g., “pfizer covid vaccine”, “covid vaccine via”, “covid vaccine rollout”, “first covid vaccine”, “covid vaccine dose”, “rollout covid vaccine”). The common trigrams are depicted in grey in Fig. 4, while the ones that are specific to female and male authors are represented with red and blue respectively.

On the other hand, if the same n-grams analysis is performed on the tweets for which the gender of the authors has been determined using the benchmark classifier, no significant differences can be distinguished.

5 Conclusion

Correctly identifying the differences in gender discourse can be of the utmost importance in shaping the right information campaigns. The present approach is most relevant in situations where more details are available regarding the authors, such as the text of their tweets or profile photos, as a complementary analysis tool, where it can be incorporated as an important component of a multimodal gender detection approach that also considers the traditional or stylistic text features extracted from the tweets, as well as the results of the profile photos analysis.

One of the limitations of the study is that it, by necessity, does not consider the full complexity of gender within society. Our approach is also unable to detect instances of gender deception, or the use of pseudonyms that do in fact conform to standards of human given names. The approach also does not distinguish between anonymous users and organizational users, which state the name of a company, product, or institution as their name. Finally, as our reference benchmark does not leverage its full potential because we have omitted using the machine vision element; it is possible that its performance could be improved by complementing it with image-based data (though the authors report mixed results from such attempts [12]). These issues can be solved in the future by extending or modifying our approach.