1 Introduction

Social networks have had a profound impact on how we humans communicate. They were originally envisioned to reach out and support the spreading of ideas, experiences, and opinions. From this premise, very popular platforms such as Facebook, Twitter, Reddit, and many others emerged. Unfortunately, these same platforms can also be exploited to show intolerance, hateful comments, aggressiveness, and harassment. Hate speech, for instance, has become a problem affecting the interactions among online groups (Burnap & Williams, 2015), since the intolerance and aggressiveness of certain users provoke a negative impact on the experience of other pairs or even online communities.

As the volume of online interactions grows minute by minute,Footnote 1 the need for automated abusive language monitoring mechanisms becomes more evident (Nobata et al., 2016). To support novel research strategies to address this need, recently, challenges and shared tasks have been promoted within the Natural Language Processing (NLP) community (Sanguinetti et al., 2020; Kumar et al., 2018; Basile et al., 2019; Aragón et al., 2020; Fersini et al., 2018), and new resources for different platforms and languages have been created to extend the scope of existing studies. For example, (Jiang et al., 2022) presents a lexicon and dataset in Chinese for sexism detection and provides an exploratory analysis of the characteristics of the latter to validate its quality and to show how sexism is manifested in the Chinese language. In Caselli et al. (2021b), the authors present a Dutch abusive language corpus, a new dataset with tweets manually annotated for abusive language. Similarly, (Plaza del Arco et al., 2021b) presents a new corpus in Spanish for offensive language identification, describing its building process, novelties, and some preliminary experiments. In Pronoza et al. (2021) the authors present a new ethnicity-target hate speech detection task in Russian and show that ethnicity-targeted hate speech is more effectively addressed with their proposed three-class approach. Furthermore, the authors of Amjad et al. (2021) introduced a collection of tweets in Urdu to assess classification methods for threatening language detection, distinguishing between threats aimed towards individuals and groups. Finally, (Vidgen et al., 2021) introduces a contextual abuse dataset in English, which has labels annotated in the context of the conversation thread, contains rationales, and uses an expert-driven group-adjudication process for high-quality annotations. Organizers of such events and developers of these resources provide real examples of texts showing reprehensible attitudes on social media platforms, in the way of hostile, hateful, or aggressive expressions.

Traditional ways to process social media posts include extracting patterns from their content and style, that is, paying full attention to the explicit text being shared. This highlights a questionable assumption, being that the message is all you need to understand its real meaning. This clearly ignores one important aspect that we humans regularly master, the context. Accordingly, the hypothesis in this study is that exploiting the context improves the classification performance of learning models. By context we particularly focus on capturing the post’s metadata such as Retweet count, Replies status, Favorite count, among other variables; but also considering author’s metadata like Default profile, Friends count, Verified, etcFootnote 2. In the end, we evaluate how the inclusion of up to 14 context variables enhances classification performance.

To test the proposed hypothesis we worked on extending seven existing Twitter benchmark datasets that had not originally provided metadata information, thus making this work, to the best of our knowledge, the largest study in terms of the number of metadata variables and a number of benchmark datasets ever evaluated. In an effort to assess if the findings are given to specific strengths of learning pipelines (or not), we consider classical (bag of words & SVM classifier), modern (GloVe & GRU), and state-of-the-art (BERT & linear layer) text classification models.

After this analysis, we observe that results are consistent across all seven datasets, suggesting that adding context obtains an improvement of up to 6% in the classification performance. Beyond this, an interesting finding is the generalization of this pipeline, since it spots hostile, offensive, aggressive, and hateful text.

The contributions of this study can be summarized as:

  1. 1.

    The creation of a new resource for the study of abusive language in social media, made up of seven Twitter datasets that were expanded by retrieving metadata from tweets available online. This new compendium of extended datasets could foster new analysis on the role of context information in the detection of this kind of unwanted behavior.

  2. 2.

    An analysis and experimental evaluation of up to fourteen context variables and three text processing models, that altogether with the seven datasets make it the more exhaustive study on the impact of metadata on the detection of abusive language on Twitter.

The remainder of this paper is organized as follows. In Section 2 we revisit relevant literature to highlight where this study stands regarding the body of knowledge. Section 3 presents the process of construction of the 7 context-enriched datasets, giving proper detail for future works willing to use this corpus. Section 4 presents the experimental setup. In Section 5 we present the results to validate this work’s hypothesis, while in Section 6 we discuss results and present statistical and error analysis. Finally, in Section 7 we conclude this work with some remarks.

2 Related work

Most of the works on detecting abusive language have been modeled as text categorization problems (Schmidt & Wiegand, 2017; Fortuna & Nunes, 2018), that is, posts, comments, or documents are assigned to one or more predefined categories based solely on their content. We organized this review on how relevant studies have represented the explicit message being shared; at the end, we also cover some studies that have attempted to exploit the context of the message or of its author.

The detection of abusive language has considered a great variety of features. Initial attempts used hand-crafted features such as bag-of-words representations, as well as syntactic and semantic-motivated features. For example, (Burnap & Williams, 2015) experimented with different configurations of n-gram approaches, finding that word unigrams and bigrams could include samples of derogatory terms which can be exploited to detect hate speech. Similarly, (Chen et al., 2012) showed that an approach including criteria such as the writing style of users, relationships between offensive words - user identifiers, and cyberbullying language patterns outperformed traditional learning strategies. Moving a step forward, (Nobata et al., 2016) fused various text features to identify abusive language: linguistic, syntactic, n-grams, and from distributional semantics. In the end, all these features made their proposal robust with better performance than state-of-the-art approaches at the time. Davidson et al. (2017) computed other features such as sentiment scores, and Part-of-Speech tag n-grams to represent information about the syntactic structure of the texts. That work also presents an initial exploration of the role of some social media metadata tokens, such as hashtags, retweets, and URLs, although they did not elaborate on the effects of adding these attributes to its feature set. Interestingly, its findings suggest that lexical methods could be an effective way to identify offensive terms but are inaccurate at identifying hate speech.

With the purpose of improving the generalization of classifiers, some recent works have explored the use of deep learning models to learn abusive language patterns without the need for explicit feature engineering. For example, Gambäck and Sikdar (2017) proposed a Convolutional Neural Network (CNN) that exploited word embeddings and one-hot character n-grams. That study outperformed a Logistic Regression model, suggesting some advantage of using deep models. Zhang et al. (2018) added a Gated Recurrent Unit (GRU) layer to a CNN model, benefiting from the feature extraction of the network while capturing order information. This architecture reported new state-of-the-art results in 6 out of 7 tested hate speech collections. In Mozafari et al. (2020), an approach based on deep contextualized word representations for hate speech detection improved the baseline model by adding a CNN as a supervised fine-tuning strategy. Their classifier outperformed the baseline scores reported in Waseem’s and Davidson’s publications (Waseem and Hovy, 2016; Davidson et al., 2017). More recently, transfer learning approaches, considering pre-trained models such as ELMO, GPT-2, and BERT, have also been successfully applied and adapted to the detection of abusive language (Liu et al., 2019; Nikolov & Radivchev, 2019; Caselli et al., 2021a). Furthermore, to address the detection of hate speech in languages other than English, a number of works have presented studies and comparisons of the effectiveness of BERT-based and traditional machine learning classifiers (Plaza del Arco et al., 2021a; Sharma et al., 2022; Pamungkas et al., 2021). These pre-trained models have also been applied in other related tasks, for example, (Gomez et al., 2020) presented a multi-modal architecture to provide text messages expressing hate speech with visual context, Nelatoori and Kommanti (2022) incorporated bidirectional embeddings into a multi-task learning approach to distinguish online toxic messages, and Pandey and Singh (2022) described a stacked arrangement of BERT and LSTMs to detect sarcastic statements.

Regarding the use of context information, a survey (Schmidt & Wiegand, 2017) states that meta-information about the background of the user can be especially predictive, and even the writers of Schulz et al. (2020) emphasize how user’s public expression is shaped by their audience. This is because a user who is known to write abusive messages may do so again, while a user who is not known to write such messages is unlikely to do so in the future. Nevertheless, this survey also refers to some works where using other kinds of metadata from the post (reply count, geographical origin, etc.) led to contradicting results. Following this idea, (Dadvar et al., 2013) used as a feature the number of profane words in the post history of a user to detect further abusive messages. The same authors of this work, presented some preliminary results in Casavantes et al., (2019); (2020), suggesting the plausibility of an approach exploiting metadata to detect aggressiveness in users’ posts. In another study (Chatzakou et al., 2017), the feature set was built by extracting the properties of the content of the messages, the traits from the users, and their use of the social network. When the authors evaluated the features’ importance through information gain, it was found that the user and network-based attributes were the most relevant, contributing to highly accurate discrimination between neutral, aggressive, and cyberbully users. Ribeiro et al. (2018) also performed hateful comments characterization, taking into consideration network and activity-based attributes. Their results suggest that using GraphSAGE (Hamilton et al., 2017), a model aimed at learning in graphs, with network and activity-based features along with GloVe embeddings (Pennington et al., 2014) improve the scores of the prediction task while also decreasing the standard deviation of 5 out of 6 quality measures.

As a final comment, we wish to remark that the latest results suggest the advantages of using deep learning strategies to learn optimized representations from posts, and also the incipient efforts to include metadata information. To have a better perspective of where the present study stands with respect to this literature we offer Table 1.

Table 1 Summary of related work that takes into account some form of metadata

3 Original and extended Twitter collections

We selected 7 recent datasets that are collections of online posts originally gathered from the Twitter platform (Waseem and Hovy, 2016; Davidson et al., 2017; Álvarez-Carmona et al., 2018; Aragón et al., 2020; Basile et al., 2019; Mandl et al., 2019). These datasets were either published as individual studies or presented in international challenges and shared tasks (Vidgen & Derczynski, 2021; Poletto et al., 2021). The availability of data and the diversity of abusive content within the scope of our research were the deciding factors in the selection of these datasets. Furthermore, we had the opportunity to participate in three of the seven collections’ shared tasks, so we took benefit of the fact that we were previously familiar with some of these resources. In Table 2 we provide the URLs to these resources. Next, we continue with a brief description of these collections.

Table 2 List of abusive language collections

Waseem and Benevolent Sexism datasets

The Waseem dataset consists of 16K tweets annotated for hate speech and collected over the course of 2 months (Waseem & Hovy, 2016). Three labels were considered: sexist, racist, and neither; however, since the dataset is made of tweet IDs and labels, and the availability of the racist class is almost non-existentFootnote 3 (only 17 out of 1972 samples at the time of assessment), we decided to discard the racist subset. Manual annotation by the creators of the corpus was reviewed with the help of an outside annotator working on gender studies, reporting an annotator agreement of κ = 0.84.

HatebaseTwitter dataset

Davidson et al. (2017) conducted a study in which the Twitter API was used to search for tweets containing keywords from a hate speech lexicon (Hatebase, 2021), resulting in a sample of tweets from 33,458 Twitter users. They extracted the timeline for each user and employed crowdsourcing from CrowdFlower to label a sample of over 24k tweets into three categories: those containing hate speech, only offensive language, and those with neither. They reported an intercoder agreement of κ = 0.92.

MEX-A3T aggressive detection track dataset

This dataset contains more than 7K tweets with hashtags related to topics of politics, sexism, homophobia and discrimination in Mexican Spanish (Álvarez-Carmona et al., 2018). The MEX-A3T team made two different datasets, one used in the first and second editions of the shared task (corresponding to 2018 and 2019) and another used in 2020 for the third edition (Aragón et al., 2020). For both collections, a set of vulgar words was used as seeds for extracting the tweets, then each tweet was labeled as aggressive or non-aggressive. The annotation provides specific criteria to separate a tweet from aggressive, offensive, and profane, based on the linguistic characteristics and intent of the message. They did not report the inter-annotator agreement.

HatEval Subtask A datasets

SemEval-2019 Task 5 - Subtask A consisted of a Hate Speech Detection task against Immigrants and Women in English and Spanish (Basile et al., 2019). To collect the tweets, the authors monitored potential victims of hate accounts, downloaded the history of identified haters, and filtered Twitter streams with keywords. The annotation task was performed by contributors from the crowd-sourcing platform Figure Eight, and by two expert annotators with previous experience in the subject. The English dataset consists of 10k tweets, whereas the Spanish collection contains 5k tweets. The inter-annotator agreement was reported as 0.83 and 0.89 for the English and Spanish datasets, respectively.

HASOC English Subtask A dataset

HASOC Subtask A required systems to classify tweets into two classes: Hate and Offensive (HOF) and Non-Hate and Offensive. The creators of this collection identified topics for which many hate posts could be expected, then data was sampled from Twitter and partially from Facebook using different hashtags and keywords (Mandl et al., 2019). The annotation process was carried out using an online system to label the tweets, reaching an inter-annotator agreement of κ = 0.89 in the English set. After the subtask results were published, the HASOC team released the complete corpus with class labels.

3.1 Context in the form of metadata

To enhance the selected datasets by incorporating metadata we followed the next protocol.

  1. 1.

    Each corpus was loaded with the proper encoding to preserve the intended format of the text messages.

  2. 2.

    For every text sample, a query was made using the Twitter API to search for that instance, setting “end_time” as the search parameter for every dataset (e.g., if a dataset was released on September 2018, we could only retrieve tweets issued until that date for that specific collection).

  3. 3.

    We took into account the similarity between the original tweet and the query results, comparing the length of strings and character placement.

  4. 4.

    We retrieved the information of the tweet with the highest similarity score.

3.1.1 Tweets’ and users’ metadata

Each tweet is represented by an object with a list of fundamental properties. Table 3 displays the tweet attributes, types of data, and descriptions from (Twitter, 2021a) that were considered in our experiments. Observe that we employed the “Date of creation” to extract the hour of the day in which a tweet was posted as a new feature (integer ranging from 0 to 23, and, from now on, referred to as “Hour”).

Table 3 Metadata of tweet object

Similar to tweets, the Twitter API Platform associates an object to each user, which indicates several of their properties (Twitter, 2021b). Table 4 displays the user attributes included in our experiments with their respective types and descriptions. In a similar fashion to the “Hour” feature, we used “Created_at” to calculate the age of the account in days.

Table 4 Metadata of user object

To wrap up this section, we present Table 5 where we show for all datasets how ended up being after the inclusion of the metadata information.

Table 5 Distribution of tweets for each dataset

4 Experimental settings

4.1 Data preprocessing

We followed standard procedures for the preprocessing of the posts reported for this task, such as the exclusion of non-alphanumeric characters and lowercasing.

Initially, there were three types of metadata: boolean, date, and integer. The boolean features were turned into integers (1/0), and the dates of tweet and account creation were changed to integer values in the form of hours and days, respectively. We transformed all the metadata using QuantileTransformer (Pedregosa et al., 2011), changing each feature individually to map the original values into a uniform distribution.

4.2 Experimental design

Figure 1 depicts the pipeline for the construction of the final vector representation that integrates the information from the texts and the metadata.

Fig. 1
figure 1

Diagram of pipeline followed to test the effect of adding metadata vectors to the text

In the last years, several statistical models have shown robust performance for NLP tasks. To account for these main branches, we decided to use three different classification models, one based on a Support Vector Machine, one on a deep GRU network, and one on a transformer-based (BERT) approach.

  • Classical - Bag of words (BoW) with tf-idf weights. This representation used an SVM as a classifier, which is one of the most powerful and versatile traditional machine-learning models. Particularly, in the experiments, we considered word unigrams, as well as a combination of unigrams, bigrams, and trigrams for the representation, and a linear kernel, C = 1, L2 normalization, and weighted for class imbalance.

  • Deep RNN - Gated Recurrent Unit (GRU). This approach used GloVe embedding vectors to obtain the representation. For each word, we obtain its vector and feed the recurrent network sequentially. The text is represented by the RNN’s hidden layer, which then passes through an attention layer and then a linear layer to perform the classification. GRUs are a simplified variant of LSTM cells. These networks are specialized to work on sequences as inputs, producing an output and then sending it back to itself as a form of memory from previous time steps (Géron, 2017). Our network used as configuration 100 neurons, an ADAM optimizer, and 300-dimensional Glove embeddings.

  • Bidirectional Encoder Representations from Transformers (BERT): For this model, we used BERT to represent the texts and a linear layer for their classification. A BERT representation is enabled to combine left and right contexts, generating a deep bidirectional Transformer (Devlin et al., 2019). For the experiments, we fine-tuned the BERT model over the training set and the text is represented using the [CLS] vector with a dense layer for the classification.

Take again as reference Fig. 1. Note that to evaluate the hypothesis we followed two different classification pipelines for each dataset. The baseline “Text” pipeline uses a feature set built from the text of tweets as the only input to the classifiers, that is, this is the common approach that any classifier would follow for this task. Our proposal (Fig. 1) follows a slightly different configuration, adding the metadata features, which are concatenated at the end of each tweet text vector.

4.3 Evaluation

For the experiments, we used the expanded collections and ran a 10-fold cross-validation, splitting for each fold the data in 80% for training, 10% for validation, and 10% for testing. We collected values for the standard text classification measures: Accuracy (ACC), Precision, Recall and F1-score; excluding the accuracy, we report macro averages for multiclass classification tasks and the values over the abusive class for binary classification tasks. To test for statistical significance, we used the F1-scores on a Bayesian Wilcoxon signed-rank test (Benavoli et al., 2014; Benavoli et al., 2017).

5 Results

Table 6 presents the complete results of the experimentation using the original and the proposed pipeline which includes text+metadata. For clarity purposes, we only include results from the baseline pipeline (only text) and the best result achieved by adding metadata (considering only User metadata, only Post metadata, or the combination of both). Results are very consistent since in all pairwise comparisons, using text+metadata improved classification performance across all datasets. Moreover, with BERT, which has become an important player in the NLP arena we observe a significant increase in performance for all data sets. Something interesting to notice is that in some cases precision increases more than recall when considering metadata, thus suggesting that context from the tweet directly helps the identification of non-abusive texts. We would like to note that we also computed the results using only metadata, but the results were generally poor, obtaining on average 20 points lower than using the context information with the text.

Table 6 Results for the comparison of classifying tweets with and without metadata (+MD)

A more specific question is related to the influence of adding only tweets’ metadata (MD), users’ MD, or both to the text representation. This evaluation is shown in Fig. 2, where for each ML strategy the three options are evaluated on the Precision vs Recall space. We observe that adding metadata to SVM and GRU classifiers increases precision scores while keeping recall scores almost constant with respect to the Text-only model. Meanwhile, for BERT, although the improvements are more modest, they occur for both precision and recall. We can appreciate that the models obtained similar proportions of true positives while predicting as abusive a comparable number of tweets that were not (false positives). BERT model outperformed GRU and SVM in correctly predicting the greatest amount of abusive tweets.

Fig. 2
figure 2

Average performance of Text and Text+Metadata approaches. The labels MDTweet and MDUser correspond to the approaches using only metadata from tweets and users, respectively

6 Analysis of results

6.1 Statistical significance analysis

To evaluate the significance of including metadata in the classification pipeline we applied a Bayesian Wilcoxon signed-rank test. This test is a nonparametric Bayesian version of the Wilcoxon signed-rank test set up on the Dirichlet process and it is recommended to directly compare ML classifiers (Benavoli et al., 2014; Benavoli et al., 2017). Given the observed data, the test computes the posterior probability of the null and alternative hypotheses, providing a straightforward probability of one method being better than the other (when comparing two treatments), thus avoiding the abstract interpretation of frequentist tests.

For this analysis, we define that method A corresponds to the pipeline that relies only on the tweets’ text while method B is the proposed strategy that combines text and metadata (MD). We present the results of this statistical analysis in Table 7 over the F1 scores, where the symbol “>” is used to represent “better than”. We can observe that for 2 out of 3 treatments, there is a very high probability (> 0.98) that using text+MD offers better results than only using text. In the case of the SVM, the conclusion is that using metadata (text+MD) is practically the same as not using it. For the SVM model, using metadata (text+MD) does not provide much improvement compared to the other models. One possible explanation is that the quantity of metadata added is relatively small in comparison to the size of the original BoW representation, culminating in little impact on the relative position of the support vectors and thus on the definition of the decision hyperplane. As a result, the classification results remain relatively the same. On the other hand, deep learning models can generalize this type of information better.

Table 7 Bayesian signed-rank test results for each classifier

For a more compelling and visual interpretation of the results of this analysis we present Fig. 3, where each point represents a statistical comparison between both treatments and each vertex of the triangle is associated with a possible result of the comparison for a) SVM, b) GRU and c) BERT strategies. When comparing two algorithms A and B over a specific data set, the Bayesian Wilcoxon signed-rank test gives the likelihood of occurrence of three different scenarios: A outperforms B; A and B perform similarly (referred to as rope or region of practical equivalence); and B outperforms A. To help visualize this analysis, in Fig. 3 we map 150,000 Monte Carlo samples in barycentric coordinates as proposed by (Benavoli et al., 2014), where each vertex of the triangle is associated with each Bayesian test scenario. For example, using the data provided in Table 7, the Bayesian Test concluded that for 147,741 out of 150,000 samples, combining both text+metadata is advantageous to exclusively use text if predictions are made using BERT.

Fig. 3
figure 3

Visualization of the Bayesian Test for a) LinearSVM, b) GRU, c) BERT classifiers. Method A relies only on the tweets’ text, while Method B uses both text and metadata

6.2 What is the contribution of each metadata feature to the final result?

To measure the dependency between metadata features and labels, we calculated their Mutual Information (MI) scoresFootnote 4. We did this for each of the features, in each of the seven collections, to then report their average value. The higher this value, the greater the dependency between the given feature and the labelsFootnote 5, and therefore, the greater the relevance of the former for the prediction of the category of the tweets.

Figure 4 shows the average MI score obtained for each metadata feature, where the sizes of the polar bars correspond to these scores, that is, the bars furthest from the center correspond to the features associated with the class labels, while those closest to it to features whose values are independent of that of the labels. Accordingly, we observe that most User-based metadata obtain higher values compared to Tweet-based features. This suggests the relevance of profile-based features to the classification task at hand, being the most effective the following users’ metadata: [status count], [favorites count], and [Listed count].

Fig. 4
figure 4

Plot of average Mutual Information between metadata and class labels in all datasets. [U] means the feature comes from the User object, [T] means it comes from the Tweet object.Verified and Created_at were not taken into consideration since these features weren’t present in all collections; Quote was excluded for getting the lowest score among all metadata attributes

6.3 Corrections and new errors when considering metadata

To shed some light on how adding metadata corrects some cases, we present Table 8. From these examples, we elaborate on how metadata could be influencing the final decision of the classifier.

  • The first tweet includes some trigger words, but their context is not clear enough. However, observing that the user who wrote it still has the default profile, which is usually interpreted as avoiding network engagement, the classifier changed its decision to be a hateful message.

  • Despite the fact that the second tweet contains profanity in reference to a song’s title, the user has a large number of followers and status updates, which are unusual characteristics for haters in the corresponding dataset.

  • The user who sent the third tweet, containing a white supremacist message, has a small number of friends and followers. In addition, the account is also older than six years yet it is still using a default profile. All this extra information influences the classifier to modify its decision indicating that it is a hate-speech message.

  • Despite the fourth tweet containing trigger words such as “gay” and “queer”, the user actually uses them not as an insult but instead to describe and praise an actor. The positiveness of the message is paired up with moderately high statuses and friends counters.

Table 8 Corrections by BERT model with metadata

On the other hand, there are cases where adding metadata influences misclassification. Table 9 presents examples that were correctly classified in the absence of metadata and misclassified when the context was included. Some of the things worth considering to get an idea of what might have caused the new errors are:

  • The user that posted the first tweet attacking immigrants, has high followers and listed counters, qualities that, at least for that specific collection, the classifier learned to associate with messages devoid of hate speech.

  • The second tweet employs a trigger word in a disagreeable joke, however, the user who wrote it has a sizable following, friend, and favorites counters does not use a default profile and has made a lot of status updates, all of which are common characteristics in users who are not used to posting offensive messages.

  • The third tweet shows an example of informal language labeled as “Neither HS/offensive”. When considering the metadata, in particular, that the tweet was not retweeted or marked as favorite, the classifier changed its decision from “non-offensive” to “offensive”.

Table 9 New wrong predictions by BERT model with metadata

6.4 Discussion: theoretical and practical implications of our research

Several social networks have currently decided to ban specific forms of speech. The European Commission, for example, has set a number of commitments to counteract the spread of hate speech in collaboration with businesses such as Facebook and TwitterFootnote 6. In this regard, research into the creation of automatic algorithms for detecting abusive language is important, because a manual and complete assessment of content is obviously unfeasible.

We are aware of the conflict that exists between the route of free expression and the path of content censorship to reduce hate speech (Apple, 2022; DiLeo, 2017). Our stance is that machine learning technology can help to tag social media content and allow users themselves to choose whether to view or block that content. In other words, a classification system may warn users but not restrict them.

Previous works have mainly addressed the identification of abusive communication focusing on exploiting either hand-crafted or learned features from the explicit text in the posts. However, we believe the consideration of metadata presents an opportunity to improve current detection strategies, as we have proved that these could benefit from information about the social interactions that take place along text exchanges. Nonetheless, it is important to acknowledge that including authors’ metadata could incur ethical, unfair, and risk issues related to racial or gender bias. To avoid this phenomenon we need to pay special attention to the type of metadata that could be considered. In this sense, an interesting finding in this study is that posts’ metadata is more informative than users’ metadata, then opening the possibility to avoid some of these risks.

This study takes advantage of the general usefulness of metadata in social media, as mentioned by Poletto et al. (2021), by saying that Twitter is currently the most exploited source of textual data to build collections of abusive language. In this work, we expect to draw more attention to the usefulness, and fair treatment, of metadata by making available resources to continue studying the undesirable phenomena of abusiveness in social networks.

7 Conclusions

As accessibility to the internet becomes easier for all kinds of purposes, it is important to develop effective methods to moderate and keep an eye on abusive content. In this study, we explored if the inclusion of metadata features extracted from tweets and authors is able to offer better results to spot abusive content than considering only the explicit text in the tweets. For this, we extended seven Twitter benchmark datasets by including the context in the form of metadata features such as retweet count, favorite count, and reply status, among other variables. The results and their statistical analysis strongly suggest that a pipeline considering text and metadata obtains a clear advantage regarding a traditional approach of only using text. To reduce model bias we considered three text representation schemes in the way of Bag of Words, GLoVe embeddings, and BERT contextualized vectors. Analysis of the results also indicates that if only one metadata feature is to be used it is suggested to be one extracted from the user account, while if they are used as a complete set, then Tweet-based features should be preferred. As future work, we also want to explore the addition of followers’ information as metadata that could improve the representation of the users.