Keywords

1 Introduction

There are many questions still surrounding the issue of hate speech. For one, it is strongly debated whether hate speech should be prosecuted, or whether free speech protections should extend to it [2, 11, 16, 24]. Another question debated is regarding the best counter-measure to apply, and whether it should be suppression (through legal measures, or banning/blocklists), or whether it should be methods that tackle the root of the problem, namely counter-speech and education [4]. These arguments, however are fruitless without the ability of detecting hate speech en masse. And while manual detection may seem as a simple (albeit hardly scalable) solution, the burden of manual moderation [15], as well as the sheer amount of data generated online justify the need for an automatic solution of detecting hateful and offensive content.

1.1 Related Work

The ubiquity of fast, reliable Internet access that enabled the sharing of information and opinions at an unprecedented rate paired with the opportunity for anonymity [50] has been responsible for the increase in the spread of offensive and hateful content in recent years. For this reason, the detection of hate speech has been examined by many researchers [21, 48]. These efforts date back to the late nineties and Microsoft research, with the proposal of a rule-based system named Smokey [36]. This has been followed by many similar proposals for rule-based [29], template-based [27], or keyword-based systems [14, 21].

In the meantime, many researchers have tackled this task using classical machine learning methods. After applying the Bag-of-Words (BoW) method for feature extraction, Kwok and Wang [19] used a Naïve Bayes classifier for the detection of racism against black people on Twitter. Grevy et al. [13] used Support Vector Machines (SVMs) on BoW features for the classification of racist texts. However, since the BoW approach was shown to lead to high false positive rates-[6], others used more sophisticated feature extraction methods to obtain input for the classical machine learning methods (such as SVM, Naïve Bayes and Logistic Regression [5, 6, 39, 40]) deployed for the detection of hateful content.

One milestone in hate speech detection was deep learning gaining traction in Natural Language Processing (NLP) after its success in pattern recognition and computer vision [44], propelling the field forward [31]. The introduction of embeddings [26] had an important role in this process. For one, by providing useful features to the same classical machine learning algorithms for hate speech detection [25, 45], leading to significantly better results than those attained with the BoW approach (both in terms of memory-complexity, and classification scores [9]). Other deep learning approaches were also popular for the task, including Recurrent Neural Networks [1, 7, 10, 38], Convolutional Neural Networks [1, 12, 32, 51], and methods that combined the two [17, 41, 49].

The introduction of transformers was another milestone, in particular the high improvement in text classification performance by BERT [37]. What is more, transformer models have proved highly successful in hate speech detection competitions (with most of the top ten teams using a transformer in a recent challenge [46]). Ensembles of transformers also proved to be successful in hate speech detection [28, 30]. So much so, that such a solution has attained the best performance (i.e. on average the best performance over several sub-tasks) recently in a challenge with more than fifty participants [35]. For this reason, here, we also decided to use an ensemble of transformer models.

1.2 Contribution

Here, we apply a 5-fold ensemble training method using the RoBERTA model, which enables us to attain state-of-the-art performance on the HASOC benchmark. Moreover, by proposing additional fine-tuning, we significantly increase the performance of models trained on different folds.

2 Experimental Materials

In this section we discuss the benchmark task to be tackled. Due to the relevance of the problem, many competitions have been dedicated finding a solution [3, 18, 23, 42, 46]. For this study, we consider the challenge provided by HASOC [22] data (in particular, the English language data) to be tackled using our methods. Here, 6712 tweets (the training and test set containing 5852 and 860 tweets, respectively) were annotated into the following categories:

  • NOT: tweets not considered to contain hateful or offensive content

  • HOF: tweets considered to be hateful, offensive, or profane

Where in the training set approximately \(39\%\) of all instances (2261 tweets) were classified in the second category, while for the test set this ratio was \(28\%\) (240 tweets). Some example tweets from the training set are listed in Table 1. As can be seen, in some cases it is not quite clear why one tweet was labelled as hateful, and others were not (#fucktrump). Other examples with debatable labeling are listed in the system description papers from the original HASOC competition [33]. This can be an explanation why Ross et al. suggested to consider hate speech detection as a regression task, as opposed to a classification task [34].

Table 1. Example tweets from the HASOC dataset [22].

2.1 OffensEval

In line with the above suggestion, the training data published by Zampieri et al. for the 2020 OffensEval competition [47] were not class labels but scores, as can be seen in Table 2 below. Here, for time efficiency reasons we used the first 1 million of the 9,089,140 tweets available in the training set.

Table 2. Example tweets from the OffensEval corpus [47].

3 Experimental Methods

In this section we discuss the processing pipeline we used for the classification of HASOC tweets. This includes the text preprocessing steps taken, the short description of the machine learning models used, as well as the method of training we applied on said machine learning models.

3.1 Text Preprocessing

Text from social media sites, and in particular Twitter often lacks proper grammar/punctuation, and contains many paralinguistic elements (e.g. URLs, emoticons, emojis, hashtags). To alleviate potential problems caused by this variability, tweets were put through a preprocessing step before being fed to our model. First, consecutive white space characters were replaced by one instance, while extra white space characters were added between words and punctuation marks. Then @-mentions and links were replaced by the character series @USER and URL respectively. Furthermore, as our initial analysis did not find a significant correlation between emojis and hatefulness scores on the more than nine million tweets of the OffensEval dataset, all emojis and emoticons were removed. Hashtag characters (but not the hashtags themselves) were also removed in the process. Lastly, tweets were tokenized into words.

3.2 RoBERTa

For tweet classification and regression, in our study we used a variant of BERT [8], namely RoBERTa [20], from the SimpleTransformers library [43] (for a detailed description of transformers in general, as well as BERT and RoBERTa in particular, please see the sources cited in this paper). We did so encouraged by the text classification performance of BERT [37], as well as our preliminary experiments with RoBERTa. When training said model, we followed [43] in selecting values for our meta-parameters, with the exception of the learning rate, for which we used \(1e-5\) as our value.

3.3 5-Fold Ensemble Training

In our experiments we used the following training scheme. First, we split the HASOC train set into five equal parts, each consisting of 1170 tweets (\(Dev_1\), \(Dev_2\), \(Dev_3\), \(Dev_4\), \(Dev_5\)). This partitioning was carried out in a way that the ratio of the two different classes was the same in each subset than it was in the whole set. Then, for each development set, we created a training set using the remaining tweets from the original training set (\(Train_1\), \(Train_2\), \(Train_3\), \(Train_4\), \(Train_5\)). After creating the five folds in this manner, we used each fold to train separate RoBERTa models. The final model was then defined as the ensemble of the five individual models, where the predictions of the ensemble model was created by averaging the predicted scores of the individual models.

Here, we examined two different ensembles. For one (\(HASOC_{only}\)), we used a pretrained RoBERTa model [43], and fine-tuned it on different folds, creating five different versions of the model. Then, we averaged the predictions of these models for the final classification. To examine how further fine-tuning would affect the results, we first fine-tuned the RoBERTa model using one million tweets from the OffensEval competition for training, and ten thousand tweets for validation. Then, we further fine-tuned the resulting model in the manner described above. However, since the first fine-tuning resulted in a regression model, when further fine-tuning these models, we first replaced NOT and HOF labels with a value of 0 and 1 respectively. In this case, the predicted scores before classification were first rescaled to the 0–1 interval by min-max normalization (\(HASOC_{OffensEval}\)).

Table 3. \(F_1\)-scores of different models on the test set of the HASOC benchmark. For each model, and each \(F_1\)-score, the best result is emphasized in bold.

4 Results and Discussion

We evaluated the resulting ensembles on the HASOC test set. Results of these experiments are listed in Table 3. As can be seen in Table 3, as a result of further fine-tuning using OffensEval data, the performance of individual models significantly increased (applying the paired t-test, we find that the difference is significant, at \(p<0.05\) for both the macro, and the weighted \(F_1\)-score). The difference in the performance of the ensembles, however, is much less marked. A possible explanation for this could be that the five models in the \(HASOC_{OffensEval}\) case may be more similar to each other (given that here the original model went through more fine-tuning with the same data). Furthermore, while in the case of the \(HASOC_{only}\) model the ensemble attains better \(F_1\)-scores using both metrics, this is not the case with the \(HASOC_{OffensEval}\) model, where the best performance is attained using the model trained on the 5th fold. Regardless of this, however, both ensemble methods outperform the winner of the HASOC competition [38] in both \(F_1\)-score measures (the winning team achieving a score of 0.7882 and 0.8395 in terms of macro \(F_1\)-score, and weighted \(F_1\)-score respectively).

5 Conclusions and Future Work

In this study we have described a simple ensemble of transformers for the task of hate speech detection. Results on the HASOC challenge showed that this ensemble is capable of attaining state-of-the-art performance. Moreover, we have managed to improve the results attained by additional pre-training using in-domain data. In the future we plan to modify our pre-training approach so that models responsible for different folds are first pre-trained using different portions of the OffensEval data. Furthermore, we intend to extend our experiments to other hate speech datasets and challenges, as well as other transformer models. Lastly, we also intend to examine the explainability of resulting models.