A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

Glazkova, Anna

doi:10.1007/s13278-023-01156-y

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

Original Article
Published: 01 December 2023

Volume 13, article number 155, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Social Network Analysis and Mining Aims and scope Submit manuscript

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

Download PDF

Anna Glazkova¹

1123 Accesses
Explore all metrics

Abstract

Preprocessing is a crucial step for each task related to text classification. Preprocessing can have a significant impact on classification performance, but at present there are few large-scale studies evaluating the effectiveness of preprocessing techniques and their combinations. In this work, we explore the impact of 26 widely used text preprocessing techniques on the performance of hate and offensive speech detection algorithms. We evaluate six common machine learning models, such as logistic regression, random forest, linear support vector classifier, convolutional neural network, bidirectional encoder representations from transformers (BERT), and RoBERTa, on four common Twitter benchmarks. Our results show that some preprocessing techniques are useful for improving the accuracy of models while others may even cause a loss of efficiency. In addition, the effectiveness of preprocessing techniques varies depending on the chosen dataset and the classification method. We also explore two ways to combine the techniques that have proved effective during a separate evaluation. Our results show that combining techniques can produce different results. In our experiments, combining techniques works better for traditional machine learning methods than for other methods.

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Article 04 November 2020

Supervised Classifiers to Identify Hate Speech on English and Spanish Tweets

A Framework and Decision Algorithm to Determine the Best Feature Extraction Technique for Supporting Machine Learning-Based Hate Speech Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In connection with the development of social media, harmful content gets more opportunities for spreading. Social networks and forums allow users to express their opinions freely and anonymously, which is the undoubted advantage and achievement of social media. Nevertheless, this freedom gives many opportunities to harm the psychological health of people and their mental state (MacAvaney et al. 2019). Due to the wide use of social media and the considerable number of texts contained therein, natural language processing tools play a critical role in reducing the spreading of harmful speech. Researchers indicate several types of harmful content (Mandl et al. 2019). For example, hate speech describes negative attributes of individuals because they are members of a particular group, offensive speech contains degrading and dehumanizing language, insulting an individual, threatening with violent utterances, and so on. This variety causes the challenge of implementing harmful content detection algorithms.

There is a large volume of published studies on harmful content detection (Alrehili 2019; Schmidt and Wiegand 2019; Yin and Zubiaga 2021). Most studies indicate the importance of preprocessing techniques for the performance of text preprocessing techniques for this task. Similar to other natural language processing tasks (Symeonidis et al. 2018; Kadhim 2018), ineffective text preprocessing during hate and offensive speech detection leads to confusion in machine learning algorithms.

In this work, we perform a large-scale evaluation of text preprocessing techniques on Twitter datasets for hate and offensive speech detection. To our knowledge, we present the most comprehensive investigation so far of tweet preprocessing for the task. We separately evaluate 26 common techniques, which we divide into eight types. We perform our experiments on four harmful content detection benchmarks. We use six approaches to text classification: three of them are traditional (logistic regression, random forest, and linear support vector classifier), while others are based on deep learning paradigms (convolutional neural networks, bidirectional encoder representations from transformers (BERT), and RoBERTa). Therefore, we first carried out an extensive comparison of text preprocessing techniques for Twitter texts using two transformer-based models. Since transformers are currently widely used for this task and demonstrate state-of-the-art results on many benchmarks, these results are important for social network researchers and machine learning specialists. Our results show that some preprocessing techniques can increase the model’s performance, while others decrease the scores. We also demonstrate that the efficiency of text preprocessing depends on the selected approach to text classification and the characteristics of the dataset. Thus, selecting a classification model and formatting the dataset are crucial steps for hate and offensive speech detection. We identify effective techniques separately per datasets and models. Then we experiment with two ways to combine techniques.

The main contributions of the paper can be summarized as follows: (a) numerous preprocessing techniques were evaluated and analyzed in terms of their effectiveness on several Twitter datasets and machine learning models; (b) two strategies for combining techniques were investigated; (c) it was shown that the choice of preprocessing techniques affects the classification performance; at the same time, the characteristics of the dataset are very important and they often play a key role in the effectiveness of preprocessing. The paper is organized as follows. Section 2 presents a brief exploration of related work. Section 3 contains the description of the considered text preprocessing techniques, utilized datasets, models, and evaluation metrics. Section 4 reports and discusses the results for the separate use of techniques and for technique combination. Section 5 describes the limitations of the study. Section 6 concludes this paper.

2 Related work

The impact of text preprocessing for the task of hate and offensive speech detection is widely discussed by many scholars. Previous studies investigated the impact of different preprocessing techniques and attempted to make general conclusions on their contribution to the results of text classification.

To date, several competitions in hate and offensive speech detection have been held as part of major workshops on natural language processing. The organizers of these competitions presented their overviews that generalize the results obtained by participants, including issues related to the utilized text preprocessing techniques. For instance, during the Semeval-2019 shared task related to multilingual detection of hate speech against immigrants and women in Twitter (Basile et al. 2019), most of the submitted systems adopted traditional preprocessing techniques, such as tokenization, lowercase, stopwords, URLs, and punctuation removal. Some participants investigated Twitter-driven preprocessing procedures such as splitting hashtags into separate words, converting slang into correct English, and converting emoji into words. In particular, the authors of Montejo-Ráez (2019) converted all the mentions to a common tag and tokenized hashtags. In Ameer et al. (2019), the texts were stemmed and cleaned of stopwords. The authors of Garain and Basu (2019) utilized removing links, mentions, and spaces.

During the Semeval-2020, the shared task on multilingual offensive language identification in social media was conducted (Zampieri et al. 2020). Most teams performed some kind of preprocessing or text normalization. The most common preprocessing techniques were converting emojis to plain text, segmenting hashtags, providing the expansion of abbreviations, replacing profane words, correcting errors, lowercasing, stemming, and/or lemmatizing. Other techniques included the removal of users’ mentions, URLs, hashtags, emojis, special characters, and/or stopwords. The winner of the Task A (Offensive Language Detection) (Wiedemann et al. 2020) used a RoBERTa-based model (Liu et al. 2019). The winning solution of Task B (Categorization of Offensive Language) and Task C (Offensive Language Target Identification) (Wang et al. 2020) represented a multilingual method using pretrained language models: ERNIE (Zhang et al. 2019) and XLM-R (Conneau et al. 2020). Neither participant specified any preprocessing techniques in their papers.

The shared tasks on hate and offensive speech detection for English were also conducted as a part of the HASOC competitions in 2019–2021 (Mandl et al. 2019, 2020; Modha et al. 2021). The winner of HASOC2019 (Wang et al. 2019) proposed an LSTM-based approach (Hochreiter and Schmidhuber 1997) and used the following preprocessing scheme. The words were retained for hashtags; username mentions were tokenized; all contractions were split into two tokens; and emoji were replaced with the corresponding words by emotion lexicons. In 2020, the first-place solution (Mishra et al. 2020) was based on an LSTM that used GloVe embeddings (Pennington et al. 2014) as input. Initially, the texts were converted into lowercase, then all punctuation marks were removed from the texts. In 2021, the winner of the task did not submit a system description paper. The second- and third-place (Bölücü and Canbay 2021; Glazkova et al. 2021) solutions were based on graph convolutional network (GCN) (Wang et al. 2020; Liao et al. 2021) and Twitter-RoBERTa (Barbieri et al. 2020), respectively. In the first case, the authors used tokenization of hashtags, removed repeated characters, and preprocessed emphasis and censored words. In the second case, all users’ mentions were replaced with a special placeholder, and the URLs were removed.

Other recent shared tasks related to hate and offensive speech detection in English posts include shared tasks on toxic span detection (Pavlopoulos et al. 2021), identification of hate and offensive speech in code-mixed postings (Modha et al. 2022), online sexism detection (Kirk et al. 2023), and identification of hate speech in multimodal content (Thapa et al. 2023). These shared tasks are summarized in Table 1.

Table 1 Summary of recent shared tasks aimed at hate and offensive speech detection in English

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

Abstract

Similar content being viewed by others

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Supervised Classifiers to Identify Hate Speech on English and Spanish Tweets

A Framework and Decision Algorithm to Determine the Best Feature Extraction Technique for Supporting Machine Learning-Based Hate Speech Detection

Explore related subjects

1 Introduction

2 Related work

3 Methods

3.1 Preprocessing techniques

3.1.1 Basic techniques

3.1.2 Handling of digits

3.1.3 Handling of URLs

3.1.4 Handling of mentions

3.1.5 Handling of emoji and emoticons

3.1.6 Handling of hashtags

3.1.7 Lexical transformations

3.1.8 Corrections

3.2 Datasets

3.3 Models

3.4 Evaluation metrics

3.5 Baselines

4 Results and discussion

4.1 Evaluation of basic techniques

4.2 Separate evaluation for techniques

4.2.1 Handling of digits

4.2.2 Handling of URLs

4.2.3 Handling of mentions

4.2.4 Handling of emoji and emoticons

4.2.5 Handling of hashtags

4.2.6 Lexical transformations

4.2.7 Corrections

4.2.8 Summary for separate evaluation of preprocessing techniques

4.3 Combinations of techniques

5 Limitations

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Appendices

Appendix A: Lists of emoticons, contractions, and acronyms

Appendix B: The main parameters of LR, RF, and LSVC

Appendix C: The architecture of CNN

Appendix D: The values of standard deviation across the folds for separate evaluation of preprocessing techniques

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation