Keywords

1 Introduction

Online social media has allowed dissemination of information at a faster rate than ever [22, 23]. This has allowed bad actors to use this for their nefarious purposes such as propaganda spreading, fake news, and hate speech. Hate speech is defined as a “direct and serious attack on any protected category of people based on their race, ethnicity, national origin, religion, sex, gender, sexual orientation, disability or disease” [11]. Representative examples of hate speech are provided in Table 1.

Hate speech is increasingly becoming a concerning issue in several countries. Crimes related to hate speech have been increasing in the recent times with some of them leading to severe incidents such as the genocide of the Rohingya community in Myanmar, the anti-Muslim mob violence in Sri Lanka, and the Pittsburg shooting. Frequent and repetitive exposure to hate speech has been shown to desensitize the individual to this form of speech and subsequently to lower evaluations of the victims and greater distancing, thus increasing outgroup prejudice [35]. The public expressions of hate speech has also been shown to affect the devaluation of minority members [18], the exclusion of minorities from the society [26], and the discriminatory distribution of public resources [12].

While the research in hate speech detection has been growing rapidly, one of the current issues is that majority of the datasets are available in English language only. Thus, hate speech in other languages are not detected properly and this could be detrimental. While there are few datasets [3, 27] in other language available, as we observe, they are relatively small in size.

Table 1. Examples of hate speech.

In this paper, we perform the first large scale analysis of multilingual hate speech by analyzing the performance of deep learning models on 16 datasets from 9 different languages. We consider two different scenarios and discuss the classifier performance. In the first scenario (monolingual setting), we only consider the training and testing from the same language. We observe that in low resource scenario models using LASER embedding with Logistic regression perform the best, whereas in high resource scenario, BERT based models perform much better. We also observe that simple techniques such as translating to English and using BERT, achieves competitive results in several languages. In the second scenario (multilingual setting), we consider training data from all the other languages and test on one target language. Here, we observe that including data from other languages is quite effective especially when there is almost no training data available for the target language (aka zero shot). Finally, from the summary of the results that we obtain, we construct a catalogue indicating which model is effective for a particular language depending on the extent of the data available. We believe that this catalogue is one of the most important contributions of our work which can be readily referred to by future researchers working to advance the state-of-the-art in multilingual hate speech detection.

The rest of the paper is structured as follows. Section 2 presents the related literature for hate speech classification. In Sect. 3, we present the datasets used for the analysis. Section  4 provides details about the models and experimental settings. In Sect. 5, we note the key results of our experiments. In Sect. 6 we discuss the results and provide error analysis.

2 Related Works

Hate speech lies in a complex nexus with freedom of expression, individual, group and minority rights, as well as concepts of dignity, liberty and equality [16]. Computational approaches to tackle hate speech has recently gained a lot of interest. The earlier efforts to build hate speech classifiers used simple methods such as dictionary look up [19], bag-of-words [6]. Fortuna et al. [13] conducted a comprehensive survey on this subject.

With the availability of larger datasets, researchers started using complex models to improve the classifier performance. These include deep learning [37] and graph embedding techniques [30] to detect hate speech in social media posts. Zhang et al. [37] used deep neural network, combining convolutional and gated recurrent networks to improve the results on 6 out of 7 datasets used. In this paper, we have used the same CNN-GRU model for one of our experimental settings (monolingual scenario).

Research into the multilingual aspect of hate speech is relatively new. Datasets for languages such as Arabic and French [27], Indonesian [21], Italian [33], Polish [29], Portuguese [14], and Spanish [3] have been made available for research. To the best of our knowledge, very few works have tried to utilize these datasets to build multilingual classifiers. Huang et al. [20] used Twitter hate speech corpus from five languages and annotated them with demographic information. Using this new dataset they study the demographic bias in hate speech classification. Corazza et al. [8] used three datasets from three languages (English, Italian, and German) to study the multilingual hate speech. The authors used models such as SVM, and Bi-LSTM to build hate speech detection models. Our work is different from these existing works as we perform the experiment on a much larger set of languages (9) using more datasets (16). Our work tries to utilize the existing hate speech resources to develop models that could be generalized for hate speech detection in other languages.

3 Dataset Description

We looked into the datasets available for hate speech and found 16 publiclyFootnote 1 available sources in 9 different languagesFootnote 2. One of the immediate issues, we observed was the mixing of several types of categories (offensive, profanity, abusive, insult etc.). Although these categories are related to hate speech, they should not be considered as the same [9]. For this reason, we only use two labels: hate speech and normal, and discard other labels. Next, we explain the datasets in different languages. The overall dataset statistics are noted in Table 2.

Arabic: We found two arabic datasets that were built for hate speech detection.

  • Mulki et al. [25] : A Twitter datasetFootnote 3 for hate speech and abusive language. For our task, we ignored the abusive class and only considered the hate and normal class.

  • Ousidhoum et al. [27]: A Twitter datasetFootnote 4 with multi-label annotations. We have only considered those datapoints which have either hate speech or normal in the annotation label.

Table 2. Dataset details

English: Majority of the hate speech datasets are available in English language. We select six such publicly available datasets.

  • Davidson et al. [9] provided a three class Twitter datasetFootnote 5, the classes being hate speech, abusive speech, and normal. We have only considered the hate speech and normal class for our task.

  • Gibert et al. [17] provided a hate speech datasetFootnote 6 consisting sentences from StormfrontFootnote 7, a white supremacist forum. Each sentence is tagged as either hate or normal.

  • Waseem et al. [36] provided a Twitter datasetFootnote 8 annotated into classes: sexism, racism, and neither. We considered the tweets tagged as sexism or racism as hate speech and neither class as normal.

  • Basile et al. [3] provided multilingual Twitter datasetFootnote 9 for hate speech against immigrants and women. Each post is tagged as either hate speech or normal.

  • Ousidhoum et al. [27] provided Twitter dataset (See Footnote 6) with multi-label annotations. We have only considered those datapoints which have either hate speech or normal in the annotation label.

  • Founta et al. [15] provided a large datasetFootnote 10 of 100K annotations divided in four classes: hate speech, abusive, spam, and normal. For our task, we have only considered the datapoints marked as either hate or normal, and ignored the other classes.

German: We select two datasets available in German language.

  • Ross et al. [32] provided a German hate speech datasetFootnote 11 for the refugee crisis. Each tweet is tagged as hate speech or normal.

  • Bretschneider et al. [5] provided a Facebook hate speech datasetFootnote 12 against foreigners and refugees.

Indonesian. We found two datasets for the Indonesian language.

  • Ibrohim et al. [21] provided an Indonesian multi-label hate speech and abusive datasetFootnote 13. We only consider the hate speech label for our task and other labels are ignored.

  • Alfina et al. [1] provided an Indonesian hate speech datasetFootnote 14. Each post is tagged as hateful or normal.

Italian. We found two datasets for the Italian language.

  • Sanguinetti et al. [33] provided an Italian hate speech datasetFootnote 15 against the minorities in Italy.

  • Bosco et al. [4] provided hate speech datasetFootnote 16 collected from Twitter and Facebook.

Polish. We found only one dataset for the Polish language

  • Ptaszynski et al. [29] provided a cyberbullying datasetFootnote 17 for the Polish language. We have only considered hate speech and normal class for our task.

Portuguese. We found one dataset for the Portuguese language

  • Fortuna et al. [14] developed a hierarchical hate speech datasetFootnote 18 for the Portuguese language. For our task, we have used the binary class of hate speech or normal.

Spanish. We found two dataset for the Spanish language.

  • Basile et al. [3] provided multilingual hate speech dataset (See Footnote 11) against immigrants and women.

  • Pereira et al. [28] provided hate speech datasetFootnote 19 for the Spanish language.

French

  • Ousidhoum et al. [27] provided Twitter dataset (See Footnote 6) with multi-label annotations. We have only considered those data points which have either hate speech or normal in the annotation label.

4 Experiments

For each language, we combine all the datasets and perform stratified train/ validation/ test split in the ratio 70%/10%/20%. For all the experiments, we use the same splits of train/val/test. Thus, the results are comparable across different models and settings. We report macro F1-score to measure the classifier performance. In case we select a subset of the dataset for the experiment, we repeated the subset selection with 5 different random sets and report the average performance. This would help to reduce the performance variation across different sets. In our experiments, the subsets are stratified samples of size 16, 32, 64, 128, 256.

4.1 Embeddings

In order to train models in multilingual setting, we need multilingual word/sentence embeddings. For sentences, LASER embeddings were used and for words MUSE embeddings were used.

Laser embeddings: LASERFootnote 20 denotes Language-Agnostic SEntence Representations [2]. Given an input sentence, LASER provides sentence embeddindgs which are obtained by applying max-pooling operation over the output of a BiLSTM encoder. The system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages.

Muse embeddings: MUSEFootnote 21 denotes Multilingual Unsupervised and Supervised Embeddings. Given an input word, MUSE gives as output the corresponding word embedding [7]. MUSE builds a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

4.2 Models

CNN-GRU (Zhang et al. [37]): This model initially maps each of the word in a sentence into a 300 dimensional vector using the pretrained Google News Corpus embeddings [24]. It also pads/clips the sentences to a maximum of 100 words. Then this \(300 \times 100\) vector is passed through drop layer and finally to a 1-D convolution layer with 100 filters. Further, a maxpool layer reduces the dimension to \(25 \times 100\) feature matrix. Now this is passed through a GRU layer and it outputs a \(100 \times 100\) dimension matrix which is globally max-pooled to provide a \(1 \times 100\) vector. This is further passed through a softmax layer to give us the final prediction.

BERT: BERT [10] stands for Bidirectional Encoder Representations from Transformers pretrained on data from english language. It is a stack of transformer encoder layers with multiple “heads”, i.e. fully connected neural networks augmented with a self attention mechanism. For every input token in a sequence, each head computes key value and query vectors which are further used to create a weighted representation. The outputs of each head in the same layer are combined and run through a fully connected layer. Each layer is wrapped with a skip connection and a layer normalization is applied after it. In our model we set the token length to 128 for faster processing of the queryFootnote 22.

mBERT: Multilingual BERT (mBERTFootnote 23) is a version of BERT that was trained on Wikipedia in 104 languages. Languages with a lot of data were sub-sampled and others were super sampled and the model was pretrained using the same method as BERT. mBERT generalizes across some scripts and can retrieve parallel sentences. mBERT is simply trained on a multilingual corpus with no language IDs, but it encodes language identities. We used mBERT to train hate speech detection model in different languages once again limiting to a maximum of 128 tokens for sentence representation.

Translation: One simple way to utilize datasets in different languages is to rely on translation. Simple techniques of translation has shown to give good results in tasks such as sentiment analysis [34]. We use Google TranslateFootnote 24 to convert all the datasets in different languages to English since translation to English from other languages typically have less errors in comparison to the other way round.

For our experiments we use the following four models:

  1. 1.

    MUSE + CNN-GRU: For the given input sentence, we first obtain the corresponding MUSE embeddings which are then passed as input to the CNN-GRU model.

  2. 2.

    Translation + BERT: The input sentence is first translated to the English language which are then provided as input to the BERT model.

  3. 3.

    LASER + LR: For the given input sentence, we first obtain the corresponding LASER embeddings which are then passed as input to a Logistic Regression (LR) model.

  4. 4.

    mBert: The input sentence is directly fed to the mBert model.

4.3 Hyperparameter Optimization

We use the validation set performance to select the best set of hyperparameters for the test set. The hyperparameters used in our experiments are as follows: batch size: 16, learning rate: \(2e^{-5}, 3e^{-5}, 5e^{-5}\) and epochs: 1, 2, 3, 4, 5.

5 Results

5.1 Monolingual Scenario

In this setting, we use the data from the same language for training, validation and testing. This scenario commonly occurs in the real world where monolingual dataset is used to build classifiers for a specific language.

Observations: Table 3 reports the results of the monolingual scenario. As expected, we observe that with increasing training data, the classifier performance increases as well. However, the relative performance seem to vary depending on the language and the model. We make several observations. First, LASER + LR performs the best in low-resource settings (16, 32, 64, 128, 256) for all the languages. Second, we observe that MUSE + CNN-GRU performs the worst in almost all the cases. Third, Translation + BERT seems to achieve competitive performance for some of the languages such as German, Polish, Portuguese, and Spanish. Overall we observe that there is no ‘one single recipe’ for all languages; however, Translation + BERT seems to be an excellent compromise. We believe that improved translations in some languages can further improve the performance of this model.

Although LASER + LR seems to be doing good in low resource setting, if enough data is available, we observe that BERT based models: Translation + BERT (English, German, Polish, and French) and mBERT (Arabic, Indonesian, Italian, and Spanish) are doing much better. However, what is more interesting is that although BERT based models are known to be successful when a larger number of datapoints are available, even with 256 datapoints some of these models seem to come very close to LASER + LR; for instance, Translation + BERT (Spanish, French) and mBERT (Arabic, Indonesian, Italian).

Table 3. Monolingual scenario: the training, validation and testing data is used from the same language. Here, Full D represents the full training data. The bold figures represent the best scores and underline represents the second best.
Table 4. Multilingual scenario: the training data is from all the languages except one and the validation and testing data is from the remaining language. The bold figures represent the best scores.

5.2 Multilingual Scenario

In this setting, we will use the dataset from all the languages expect one \((N-1)\), and use the validation and test set of the remaining language. This scenario represents when one wishes to employ the existing hate speech dataset to build a classifier for a new language. We have considered LASER + LR and mBERT that are most relevant for this analysis. In the LASER + LR model, we take the LASER embeddings from the \((N-1)\) languages and add to this the target language data points in incremental steps of 16, 32, 64, 128 and 256. The logistic regression model is trained on the combined data, and we test it on the held out test set of the target language.

For using the multilingual setting in mBERT we adopt a two-step fine-tuning method. For a language L, we use the dataset for \(N-1\) languages (except the \(L^\text {th}\) language) to train the mBERT model. On this trained mBERT model, we perform a second stage of fine-tuning using the training data of the target language in incremental steps of 16, 32, 64, 128, 256. The model was then evaluated on the test set of the \(L^\text {th}\) language.

We also test the models for zero shot performance. In this case, the model is not provided any data of the target language. So, the model is trained on the \((N-1)\) languages and directly tested on the \(N^\text {th}\) language test set. This would be the case in which we would like to directly deploy a hate speech classifier for a language which does not have any training data.

Observations: Table 4 reports the results of the multilingual scenario. Similar to the monolingual scenario, we observe that with increasing training data, the classifier performance increases in general.

This is especially true in low resource settings of the target languages such as English, Indonesian, Italian, Polish, Portuguese.

In case of zero shot evaluation, we observe that mBERT performs better than LASER + LR in three languages (Arabic, German, and French). LASER + LR perform better on the remaining six languages with the results in Italian and Portuguese being pretty good. In case of Portuguese, zero shot Laser + LR (without any Portuguese training data) obtains an F-score of 0.6567, close to the best result of 0.6941 (using full Portuguese training data).

For the languages such as Arabic, German, and French, mBERT seems to be performing better than LASER + LR is almost all the cases (low resource and Full D). LASER + LR, on the other hand, is able to perform well for Portuguese language in all the cases. For the rest of the five languages, we observe that LASER + LR is performing better in low resource settings, but on using the full training data of the target language, mBERT performs better.

5.3 Possible Recipes Across Languages

As we have used the same test set for both the scenarios, we can easily compare the results to access which is better. Using the results from monolingual and multilingual scenario, we can decide the best kind of models to use based on the availability of the data. The possible recipes are presented as a catalogue in Table 5. Overall we observe that LASER + LR model works better for low resource settings while BERT based models work well for high resource settings. This possibly indicates that BERT based models, in general can work well when there is larger data available thus allowing for a more accurate fine-tuning. We believe that this catalogue is one of the most important contributions of our work which can be readily referred to by future researchers working to advance the state-of-the-art in multilingual hate speech detection.

Table 5. The table describes the best model to use in low and high resource scenario. In general, LASER + LR performs well in low resource setting and BERT based models are better in high resource settings

6 Discussion and Error Analysis

6.1 Interpretability

In order to compare the interpretability of mBERT and LASER + LR, we use LIME [31] to calculate the average importance given to words by a particular model. We compute the top 5 most predictive words and their attention for each sentence in the test set. The total score for each word is calculated by summing up all the attentions for each of the sentences where the word occurs in the top 5 LIME features. The average predictive score for each word is calculated by dividing this total score by the occurrence count of each word. In Table 6 we note the top 5 words having the highest attention scores and compare them qualitatively across models.

Table 6. Interpretations of the model outcomes.

While comparing the models’ interpretability in Table 6, we see that LASER + LR focuses more on the hateful keywords compared to mBERT, i.e., words like ‘pigs’ etc. mBERT seems to search for some context of the hate keywords as shown in Table 7. Models dependent on the keywords can be useful when we are in a highly toxic environment such as GABFootnote 25 since most of the derogatory keywords typically occur very close or at least simultaneously along with the hate target, for e.g., the first case in Table 1. In sites which are less toxic like Twitter, complex methods giving attention to the context like mBERT might be more helpful,for e.g., the third case in Table 1.

Table 7. Examples showing word with the highest predictive word for both and .
Table 8. Various types of errors (E) for the models (M) : mBERT and LASER + LR. The ground truth (GT) and prediction (P) consist of 0 (Non-Hate)/1 (Hate) label.

6.2 Error Analysis

In order to delve further into the models, we conduct an error analysisFootnote 26 on both the mBERT and LASER + LR models using a sample of posts where the output was wrongly classified from the test set.We analyze the common errors and categorize them into the following four types:

  1. 1.

    Wrong classification due to annotation’s dilemma (AD): These error cases occur due to ambiguous instances where according to us the model predicts correctly but the annotators have labelled it wrong.

  2. 2.

    Wrong classification due to confounding factors (CF): These error cases are caused when the model predictions rely on some irrelevant features like normalized form of mentions (@user) and links (URL) in the text.

  3. 3.

    Wrong classification due to hidden context (HC): These error cases are caused when the model fails to capture the context of the post.

  4. 4.

    Wrong classification due to abusive words (AW): These error cases are caused by over-dependence of the model on the abusive words.

Table 8 shows the errors of the mBERT and LASER + LR models. For mBERT, the first example has no specific indication of being a hate speech and is considered an error on the part of annotators. In the second example the author of the post actually wants the reader to not use the abusive terms, i.e., sl*t and wh*re (found using LIME) but the model picks them as indicators of hate speech. The third example has mentioned the term “parasite" as a derogatory remark to refugees and the model did not understand it.

For the LASER + LR model, the first example is an error on the part of the annotators. In the second case the model captures the word “USER" (found using LIME), a confounding factor which affects the models’ prediction. For the third case, the author says (s)he will leave before homosexuality gets normalized which shows his/her hatred toward the LGBT community but the model is unable to capture this. In the last case the model predicts hate speech based on the word “retarded" (found using LIME) which should not be the case.

7 Conclusion

In this paper, we perform the first large scale analysis of multilingual hate speech. Using 16 datasets from 9 languages, we use deep learning models to develop classifiers for multilingual hate speech classification. We perform many experiments under various conditions – low and high resource, monolingual and multilingual settings – for a variety of languages. Overall we see that for low resource, LASER + LR is more effective while for high resource BERT models are more effective. We finally suggest a catalogue which we believe will be beneficial for future research in multilingual hate speech detection. Our Code (Code: https://github.com/punyajoy/DE-LIMIT) and Models (Models: https://huggingface.co/Hate-speech-CNERG) are available online for other researchers to use.