1 Introduction

In connection with the development of social media, harmful content gets more opportunities for spreading. Social networks and forums allow users to express their opinions freely and anonymously, which is the undoubted advantage and achievement of social media. Nevertheless, this freedom gives many opportunities to harm the psychological health of people and their mental state (MacAvaney et al. 2019). Due to the wide use of social media and the considerable number of texts contained therein, natural language processing tools play a critical role in reducing the spreading of harmful speech. Researchers indicate several types of harmful content (Mandl et al. 2019). For example, hate speech describes negative attributes of individuals because they are members of a particular group, offensive speech contains degrading and dehumanizing language, insulting an individual, threatening with violent utterances, and so on. This variety causes the challenge of implementing harmful content detection algorithms.

There is a large volume of published studies on harmful content detection (Alrehili 2019; Schmidt and Wiegand 2019; Yin and Zubiaga 2021). Most studies indicate the importance of preprocessing techniques for the performance of text preprocessing techniques for this task. Similar to other natural language processing tasks (Symeonidis et al. 2018; Kadhim 2018), ineffective text preprocessing during hate and offensive speech detection leads to confusion in machine learning algorithms.

In this work, we perform a large-scale evaluation of text preprocessing techniques on Twitter datasets for hate and offensive speech detection. To our knowledge, we present the most comprehensive investigation so far of tweet preprocessing for the task. We separately evaluate 26 common techniques, which we divide into eight types. We perform our experiments on four harmful content detection benchmarks. We use six approaches to text classification: three of them are traditional (logistic regression, random forest, and linear support vector classifier), while others are based on deep learning paradigms (convolutional neural networks, bidirectional encoder representations from transformers (BERT), and RoBERTa). Therefore, we first carried out an extensive comparison of text preprocessing techniques for Twitter texts using two transformer-based models. Since transformers are currently widely used for this task and demonstrate state-of-the-art results on many benchmarks, these results are important for social network researchers and machine learning specialists. Our results show that some preprocessing techniques can increase the model’s performance, while others decrease the scores. We also demonstrate that the efficiency of text preprocessing depends on the selected approach to text classification and the characteristics of the dataset. Thus, selecting a classification model and formatting the dataset are crucial steps for hate and offensive speech detection. We identify effective techniques separately per datasets and models. Then we experiment with two ways to combine techniques.

The main contributions of the paper can be summarized as follows: (a) numerous preprocessing techniques were evaluated and analyzed in terms of their effectiveness on several Twitter datasets and machine learning models; (b) two strategies for combining techniques were investigated; (c) it was shown that the choice of preprocessing techniques affects the classification performance; at the same time, the characteristics of the dataset are very important and they often play a key role in the effectiveness of preprocessing. The paper is organized as follows. Section 2 presents a brief exploration of related work. Section 3 contains the description of the considered text preprocessing techniques, utilized datasets, models, and evaluation metrics. Section 4 reports and discusses the results for the separate use of techniques and for technique combination. Section 5 describes the limitations of the study. Section 6 concludes this paper.

2 Related work

The impact of text preprocessing for the task of hate and offensive speech detection is widely discussed by many scholars. Previous studies investigated the impact of different preprocessing techniques and attempted to make general conclusions on their contribution to the results of text classification.

To date, several competitions in hate and offensive speech detection have been held as part of major workshops on natural language processing. The organizers of these competitions presented their overviews that generalize the results obtained by participants, including issues related to the utilized text preprocessing techniques. For instance, during the Semeval-2019 shared task related to multilingual detection of hate speech against immigrants and women in Twitter (Basile et al. 2019), most of the submitted systems adopted traditional preprocessing techniques, such as tokenization, lowercase, stopwords, URLs, and punctuation removal. Some participants investigated Twitter-driven preprocessing procedures such as splitting hashtags into separate words, converting slang into correct English, and converting emoji into words. In particular, the authors of Montejo-Ráez (2019) converted all the mentions to a common tag and tokenized hashtags. In Ameer et al. (2019), the texts were stemmed and cleaned of stopwords. The authors of Garain and Basu (2019) utilized removing links, mentions, and spaces.

During the Semeval-2020, the shared task on multilingual offensive language identification in social media was conducted (Zampieri et al. 2020). Most teams performed some kind of preprocessing or text normalization. The most common preprocessing techniques were converting emojis to plain text, segmenting hashtags, providing the expansion of abbreviations, replacing profane words, correcting errors, lowercasing, stemming, and/or lemmatizing. Other techniques included the removal of users’ mentions, URLs, hashtags, emojis, special characters, and/or stopwords. The winner of the Task A (Offensive Language Detection) (Wiedemann et al. 2020) used a RoBERTa-based model (Liu et al. 2019). The winning solution of Task B (Categorization of Offensive Language) and Task C (Offensive Language Target Identification) (Wang et al. 2020) represented a multilingual method using pretrained language models: ERNIE (Zhang et al. 2019) and XLM-R (Conneau et al. 2020). Neither participant specified any preprocessing techniques in their papers.

The shared tasks on hate and offensive speech detection for English were also conducted as a part of the HASOC competitions in 2019–2021 (Mandl et al. 2019, 2020; Modha et al. 2021). The winner of HASOC2019 (Wang et al. 2019) proposed an LSTM-based approach (Hochreiter and Schmidhuber 1997) and used the following preprocessing scheme. The words were retained for hashtags; username mentions were tokenized; all contractions were split into two tokens; and emoji were replaced with the corresponding words by emotion lexicons. In 2020, the first-place solution (Mishra et al. 2020) was based on an LSTM that used GloVe embeddings (Pennington et al. 2014) as input. Initially, the texts were converted into lowercase, then all punctuation marks were removed from the texts. In 2021, the winner of the task did not submit a system description paper. The second- and third-place (Bölücü and Canbay 2021; Glazkova et al. 2021) solutions were based on graph convolutional network (GCN) (Wang et al. 2020; Liao et al. 2021) and Twitter-RoBERTa (Barbieri et al. 2020), respectively. In the first case, the authors used tokenization of hashtags, removed repeated characters, and preprocessed emphasis and censored words. In the second case, all users’ mentions were replaced with a special placeholder, and the URLs were removed.

Other recent shared tasks related to hate and offensive speech detection in English posts include shared tasks on toxic span detection (Pavlopoulos et al. 2021), identification of hate and offensive speech in code-mixed postings (Modha et al. 2022), online sexism detection (Kirk et al. 2023), and identification of hate speech in multimodal content (Thapa et al. 2023). These shared tasks are summarized in Table 1.

Table 1 Summary of recent shared tasks aimed at hate and offensive speech detection in English

In addition to the listed studies, some authors performed a comparison of text preprocessing techniques for the task of hate and offensive speech detection. A large-scale study on comparison text preprocessing techniques for hate speech detection on Twitter was presented in Naseem et al. (2021). The authors compared twelve preprocessing techniques for both traditional and deep learning classifiers. Deep learning approaches included convolutional neural networks, LSTM, and BiLSTM. The authors recommended a combination of preprocessing techniques based on the experiments on three datasets. The best-performing techniques were lemmatization and lower-casing of words, while the worst-performing techniques were removing punctuation, URLs, users’ mentions, and hashtag symbols. Results varied with different learning algorithms, which confirmed that choosing a suitable learning algorithm is a considerable factor in text classification performance. The authors stressed the importance of the investigation of various combinations of preprocessing techniques and their interactions. In this study, some different similar techniques were evaluated together. For example, the removal of URLs, hashtags, and mentions were performed simultaneously.

Since harmful content detection and sentiment analysis are close tasks (Zhou et al. 2021; Plaza-Del-Arco et al. 2021), we also investigated research related to evaluating the effectiveness of text preprocessing techniques for sentiment analysis of tweets. In Angiani et al. (2016), the authors compared several preprocessing techniques (stemming, removal of stopwords, processing of emoticons) utilizing the naive Bayes classifier. They achieved an improvement over the baseline result through the use of stemming. A comparison of 16 preprocessing techniques across four traditional machine learning algorithms on two datasets was presented in Symeonidis et al. (2018). Lemmatization, removing numbers, and replacing contractions improved the performance of classifiers, while others did not. The authors of Alam and Yao (2019) showed that the accuracy of some traditional machine learning algorithms can be significantly improved after applying several preprocessing steps: removing emoticons and stopwords and stemming. In Ramachandran and Parvathi (2019), the authors showed a positive effect of stopwords removal for the performance of the naive Bayes classifier. A summary of the listed studies is provided in Table 2. Research on the effectiveness of preprocessing techniques for sentiment analysis is not limited to using only Twitter texts. A number of studies investigated features for sentiment analysis of spam reviews (Saeed et al. 2018, 2020, 2021, 2022), messages from StockTwits (Renault 2020), news (Štrimaitis et al. 2021; Dogru et al. 2021; Oliveira and Merschmann 2021), etc.

Table 2 Summary of studies performing an evaluation of preprocessing techniques for hate speech detection and sentiment analysis for Twitter datasets

Text preprocessing is an important step for creating classification models. Currently, a large number of studies on hate and offensive speech detection have been performed. In addition, several works were devoted to the comparison of text preprocessing techniques for this task. Existing research on tweet preprocessing techniques for hate and offensive speech detection mostly evaluates the effectiveness of technique combinations or limited technique types. This study is aimed at overcoming this research gap. We perform a large-scale comparison of separate preprocessing techniques on several hate and offensive speech detection benchmarks. In addition, we first conduct a large-scale analysis of two transformer-based models. We try combinations of individual effective techniques, as well as experiment with combining techniques using two ways for technique generalization.

3 Methods

3.1 Preprocessing techniques

In this study, 26 commonly used preprocessing techniques were evaluated. All considered techniques were divided into the following types:

  • basic techniques, such as converting to lowercase, lemmatizing, stemming, and removing special characters;

  • handling of digits (removing, tokenizing, and converting to words);

  • handling of URLs (removing and tokenizing);

  • handling of mentions (removing and tokenizing);

  • handling of emoji and emoticons (removing, tokenizing, and converting to textual description);

  • handling of hashtags (removing, tokenizing, and segmenting into separate words);

  • lexical transformations, i.e., removing stopwords, replacing decontractions and acronyms, tokenizing profane lexicon;

  • corrections, including spelling correction and removing repeated letters.

Sections 3.1.1 through 3.1.8 contain detailed descriptions of each considered type. For better presentation, a correspondence between preprocessing techniques utilized in this work and their sequential numbers is listed in Table 3. To implement preprocessing techniques, the following libraries were used: tweet-preprocessor,Footnote 1 NLTK (Bird 2006), num2words,Footnote 2 emoji,Footnote 3 pyspellchecker,Footnote 4 and betterprofanity.Footnote 5 Stemming was performed using the Snowball stemmer (Porter 2001).

Table 3 Preprocessing techniques

3.1.1 Basic techniques

The most common techniques used for preprocessing all types of texts are translation in the lowercase (1), normalization (lemmatization (2) or stemming (3)), and removing special characters (4). These techniques were used to train baseline models. In Sect.3.1, we experiment with a consistent exclusion of basic techniques from the baseline.

3.1.2 Handling of digits

Common approaches to preprocessing digits in the text are removing digits (5), tokenizing digits (6), and their conversion to words (7). Tokenization means replacing digits with special tokens. For example, consider the following original tweet:

figure b

3.1.3 Handling of URLs

We explored two popular techniques for preprocessing URLs, such as removing URLs (8) and tokenizing URLs (9). For instance, the following tweet:

figure c

Preprocessing of URLs is one of the most common steps for tweet preparation (Banerjee et al. 2021; Menini et al. 2021).

3.1.4 Handling of mentions

For user mentions, we used the same techniques as for the previous class, i.e., removing mentions (10) and tokenizing mentions (11). A tweet contains mentions when it includes another person’s username anywhere in its text. User mentions start with the “@” symbol. For example, the tweet:

#preprocessing is a crucial part of @ML projects. will be transformed after removing mentions to: #preprocessing is a crucial part of projects. and after tokenizing mentions to: #preprocessing is a crucial part of $MENTION$ projects.

Researchers often use preprocessing of mentions for tweet analysis, e.g., in Glazkova et al. (2021) and Banerjee et al. (2021).

3.1.5 Handling of emoji and emoticons

Emoticons represent ordinary punctuation marks from a standard computer keyboard to build up a representation of a face with a particular expression, while emoji are graphic symbols with predefined names and codes (Bai et al. 2019). Here we evaluated the following techniques: removing emoji (12), tokenizing emoji (13), converting emoji to textual description (14), removing emoticons (15), tokenizing emoticons (16), and converting emoticons to textual description (17). The list of emoticons is given in “Appendix A.”

For example, consider the following tweet:

figure d

Many researchers, in particular, the authors of Alshalan and Al-Khalifa (2020) and Ranasinghe and Hettiarachchi (2020), utilized preprocessing emoji and emoticons during hate and offensive speech detection.

3.1.6 Handling of hashtags

We considered the following techniques for preprocessing hashtags: removing hashtags (18), tokenizing hashtags (19), and hashtag segmentation (20). Hashtag segmentation is the technique that breaks a hashtag into its constituent tokens and removes the “#” symbol (Kodali et al. 2022). To split hashtags, we utilized the implementation of the maximum matching algorithm,Footnote 6 i.e., given a string s, get all possible segmentations of s into dictionary words, then return the “longest” segmentation (Reuter et al. 2016). This implementation utilizes an English vocabulary from NLTK.

For example, the tweet: i get to see my daddy today!! #80days #gettingfed will be transformed to: i get to see my daddy today!! (removing hashtags); i get to see my daddy today!! $HASHTAG$ $HASHTAG$ (tokenizing hashtags); i get to see my daddy today!! 80 days getting fed (hashtag segmentation). Previous studies widely used hashtag preprocessing for hate and offensive speech detection, for example in (Caselli et al. 2020; Toraman et al. 2022).

3.1.7 Lexical transformations

Here we considered several techniques related to the preprocessing of vocabulary. The first technique in the class is removing stopwords (21). Stopwords are extremely common words that are often excluded from texts because of their high prevalence, such as the, on, that, and of. This technique is widely used for tweet preprocessing, for example in Alshalan and Al-Khalifa (2020) and Das et al. (2021). The second technique in this class is decontraction (22), which replaces shortened combinations of functional words with their full analogues, e.g., I’ll \(\rightarrow \) I will. In particular, decontraction was used in Naseem et al. (2019) for hate speech detection. The next technique is similar to the previous one, but we replace common online acronyms (23) instead of functional words, for example, gr8 \(\rightarrow \) great. The lists of contractions and acronyms are given in “Appendix A.” The last technique in this class is tokenizing profane lexicon (24), i.e., replacing profane words with a token profanity using the better_profanity package.

3.1.8 Corrections

The last class contains two techniques related to correcting and normalizing spelling. The first technique is spelling correction (25) using a Levenshtein distance algorithm and the pyspecllchecker package. The second technique is correcting words with repeated letters (26), for instance, yeeeeees \(\rightarrow \) yes. These techniques are commonly utilized in tweet analysis, for example in Mohammad (2018) and Hu et al. (2020).

3.2 Datasets

To date, several datasets for hate and offensive speech detection on Twitter have been published. Most of them consist of tweets manually labeled as hate/offensive and neutral. In this work, the task of binary classification was considered since such task formulation is the most common (Fortuna and Nunes 2018; Poletto et al. 2021). The following datasets were utilized:

  • HateBase (Davidson et al. 2017), the dataset contains tweets of three categories: hate speech, offensive but not hate speech, or neither offensive nor hate speech (neutral). In this work, the dataset was divided into two subsets. The first comprises hate and neutral tweets (hb_hate). The second consists of offensive and neutral tweets (hb_off). This division allows us to compare the preprocessing techniques that are effective for hate and offensive speech detection, respectively.

  • HASOC2020 (hasoc) (Mandl et al. 2020), the corpus has two labels: neutral tweets and tweets containing hate, offensive, or profane content.

  • OLID (olid) (Zampieri et al. 2019), this dataset contains examples of two classes, namely offensive and not offensive. The first class includes tweets containing inappropriate language, insults, or threats. The texts from the second class are neither offensive nor profane.

  • HatEval (heval) (Basile et al. 2019), the corpus was collected for detecting hateful content in social media texts, specifically in Twitter’s posts, against two targets: immigrants and women. It consists of neutral and hateful tweets.

Most hate and offensive speech datasets are sampled by crawling social media platforms using keywords considered relevant for harmful content, scrapping hashtags, or extracting timelines for harmful content spreaders. All these methods may introduce a bias because this process might limit the collection to topics and words that are remembered (Mandl et al. 2020). Empirical studies (Wiegand et al. 2019; Davidson et al. 2019) have pointed out that these approaches may lead to bias. The following approaches were used to collect the datasets listed above. The authors of HateBase utilized the lexicon containing words and phrases identified by internet users as hate speech, compiled by Hatebase.org. They searched for tweets containing terms from the lexicon and extracted the timeline for each communicant. From this corpus, the authors took a random sample of tweets containing terms from the lexicon and had them manually coded by CrowdFlower workers. To obtain potentially hateful tweets for HASOC2020, the support vector classifier was trained on the OLID and HASOC2019 (Mandl et al. 2019) datasets. The authors of the dataset considered all the tweets that were classified as hateful by the week classifier and added five percent of the tweets that were not classified as hateful randomly. Then this set of English tweets was manually annotated. For OLID, the crowd-sourcing platform Figure Eight was used for annotation. For HatEval, the data have been collected using different gathering strategies: (1) monitoring potential victims of hate accounts, (2) downloading the history of identified haters, and (3) filtering Twitter streams with keywords, i.e., words, hashtags, and stems. The most part of the training set of tweets against women has been derived from an earlier collection carried out in the context of two previous challenges on misogyny identification (Fersini et al. 2018a, b).

The data statistics are summarized in Table 4. The average length is calculated in terms of the count of tokens determined using NLTK (Bird 2006). The users’ mentions and URLs are represented in the OLID dataset in a unified manner. Therefore, we have not applied tokenizing URLs (technique 9) and tokenizing mentions (11) to this dataset.

Table 4 Data statistics

3.3 Models

To provide a comprehensive comparison of preprocessing techniques, we chose five supervised and deep learning methods.

  • Logistic Regression (LR), a supervised machine learning method that analyzes the relationship between the variables and classifies data into discrete classes. In Oriola et al. (2020), LR was used for detecting offensive and hate speech in South African tweets. Also some researchers (Ashraf et al. 2022; Huang et al. 2020; Silva et al. 2020) used LR as a baseline.

  • Random Forest (RF), an ensemble supervised method that constructs a set of decision tree at the training time. The authors of Alfina et al. (2017) and Nugroho et al. (2019) used RF for detecting hate speech in social media.

  • Linear Support Vector Classifier (LSVC), a support vector machine model that applies a linear kernel function to perform classification. LSVC was applied for hate and offensive speech detection in Balouchzahi and Shashirekha (2020) and Fromknecht and Palmer (2020).

  • Convolutional Neural Network (CNN), an artificial neural network composed of multiple building blocks, such as convolution layers, pooling layers, and dense layers (Kim 2014). Various studies have assessed the efficacy of CNN for hate and offensive speech detection (Badjatiya et al. 2017; Alshalan and Al-Khalifa 2020; Zhou et al. 2020).

  • Bidirectional Encoder Representations from Transformers (BERT), a model pretrained on the English language using a masked language modeling objective (Devlin et al. 2019). This model has achieved great success in many natural language processing tasks, including text classification and hate speech detection (for example, Caselli et al. 2021; Li etal. 2021).

  • Robustly optimized BERT approach (RoBERTa), the model has the same architecture as BERT, but uses a byte-level byte pair encoding as a tokenizer and a different pretraining scheme (Liu et al. 2019). Many researchers utilized RoBERTa for hate speech detection, in particular, in Glazkova et al. (2021), Wiedemann et al. (2020), and Alonso et al. (2020).

In the process of forming a set of models, various classification methods were used that provide methodological diversity due to different approaches that underlie them (statistical models, support vector machines, decision trees, neural networks). The selected traditional machine learning models showed high performance during the comparison of algorithms in related works [in particular, in Krouska et al. (2016), Jianqiang and Xiaolin (2017)]. CNN, BERT, and RoBERTa are widely used solutions for text classification. For example, most of the models presented in Zampieri et al. (2020) were based on BERT- or RoBERTa-style transformers. Among the classic neural network architectures, CNN was the most popular.

To train LR, RF, and LSVC, a bag-of-words model for 10,000 features was built. LR, RF, and LSVC are implemented using Scikit-Learn (Pedregosa et al. 2011). The default settings were utilized for implementation. The parameters of the traditional classifiers are listed in “Appendix B.”

CNN was implemented using Keras (https://github.com/fchollet/keras) with a batch size equal to 256, a number of filters of 256, and a weight decay of 1e−4. The model plot is shown in “Appendix C” (Fig. 3). The parameters of CNN were selected using grid search on the example of the HASOC2020 dataset with baseline preprocessing techniques: translating to lowercase, lemmatization, and removing special characters. The range of the parameters for grid search is presented in “Appendix C” (Table 14). CNN was trained for 100 epochs; however, the actual training time is considerably shorter since strict early stopping was used.

Training CNN was conducted using the FastText word vectors constructed on the texts from Wikipedia 2017, the UMBC WebBase corpus, and the statmt.org news dataset (Joulin et al. 2016). The vector size of FastText’s model is 300. We chose the FastText word vectors as this is a widespread approach to generating word representation for different text mining tasks, especially sentiment analysis and hate speech detection. The authors of Kaibi and Satori (2019) showed the effectiveness of FastText representation compared to Word2Vec (Mikolov et al. 2013) and GloVe (Pennington et al. 2014) for Twitter datasets for sentiment analysis using six machine learning algorithms.

To implement BERT and RoBERTa, we used BERT-base-uncasedFootnote 7 and RoBERTa-base,Footnote 8 respectively, as well as Simple Transformers (Rajapakse 2019) and PyTorch (Paszke et al. 2019). Each model was fine-tuned for two epochs with a maximum sequence length of 128, a learning rate of 4e-5, and a training batch size of 8. We used the AdamW optimizer (Loshchilov and Hutter 2018) with the following parameters: an epsilon hyperparameter of 1e−8; and coefficients used for computing running averages of gradient and its square of (0.9, 0.999).

3.4 Evaluation metrics

Since we used unbalanced datasets in this study, the results were evaluated in terms of the weighted-average F1-score (F1). The weighted-average F1 is calculated by taking the mean value of all per-class F1 while considering the number of true instances for each label.

To obtain more reliable results, we performed fivefold cross-validation for traditional supervised methods (LR, RF, and LSVC) and threefold cross-validation for neural models. We used a small number of folds for neural models due to constraints on machine computing capacity. However, the division into folds was performed with a fixed random seed. Thus, a comparison of the efficiency of techniques for the model was performed on the same test data. All metrics are calculated as mean values for all folds. The values of F1 are presented in the body of the paper, while the values of standard deviation across the folds are shown in “Appendix D.”

3.5 Baselines

We used the following preprocessing for baselines:

  • Traditional models (LR, RF, and LSVC) and CNN: translating to lowercase, lemmatization, and removing special characters. Lemmatization was performed using NLTK WordNet Lemmatizer;

  • BERT: lowercase;

  • RoBERTa: raw text.

In the next section, we evaluate adding and removing basic techniques compared to baseline preprocessing.

4 Results and discussion

4.1 Evaluation of basic techniques

In Table 5, we consistently evaluated adding and removing basic techniques compared to baseline preprocessing in terms of F1-score. The table also indicates the values of standard deviation across the folds. For example, according to 3.5, LR used the following baseline preprocessing: translating to lowercase, lemmatization, and removing special characters. LR baseline − (1) indicates the LR model with baseline preprocessing, with the exception of converting to lowercase. LR baseline − (2) is the LR model with baseline preprocessing but without lemmatization. LR baseline + (3) indicates LR with baseline preprocessing but using stemming instead of lemmatization. LR baseline − (4) is the LR model with translating to lowercase and lemmatization but without removing special characters. The values that outperform baselines are shown in bold.

Table 5 Basic techniques, F1 (%). The indices of the techniques are listed in Table 3

The table demonstrates that lemmatization, translating to lowercase, and removing special characters are helpful for traditional machine learning models. Stemming can also help to increase the performance of traditional machine learning models, and for most datasets, it works better than lemmatization.

For CNN, the effect of lemmatization and translation to lowercase is not evident. In general, stemming performs worse than lemmatization for CNN. Removing special characters increases the performance of CNN for most corpora.

As for transformer-based models, the effect of lowercasing expectedly depends on what text preprocessing has been used for model pretraining. Since we used BERT-base-uncased and RoBERTa-base which is the cased model, BERT performs better using lowercasing and RoBERTa in most cases shows higher results if lowercasing was not used. Lemmatization does not significantly affect the performance. Removing special characters and stemming do not mainly increase the results for either models.

4.2 Separate evaluation for techniques

We estimated the performance of each model on all corpora, consistently adding one of the preprocessing techniques to the baseline model. The evaluation results in terms of the weighted-average F1-score are presented in Tables 6 and 7. The first rows of Tables 6 and 7 show the performance of the baseline models, i.e., the models trained on data preprocessed by the techniques specified in Sect. 3.1.5. The following rows demonstrate the performance of the models trained on the texts preprocessed using the baseline techniques and one additional technique from the list of techniques (techniques 5–26 listed in Table 3). Thus, we evaluated the effectiveness of preprocessing techniques one at a time.

In Tables 6 and 7, the scores for tokenizing URLs (9) and tokenizing mentions (11) are missed for the OLID dataset because users’ mentions and URLs are already presented in a tokenized form in this corpus. The scores for preprocessing emoji (12–14) are missed for HateBase since this dataset does not contain emoji. The values exceeding the baseline are shown in bold. The standard deviation values across the folds are demonstrated in “Appendix D.”

Table 6 Separate evaluation of techniques for logistic regression, random forest, and linear support vector classifier, F1 (%)
Table 7 Separate evaluation of techniques for Convolutional Neural Network, BERT, and RoBERTa (RB), F1 (%)

Based on the overall results, we formed five categories depending on the performance of the techniques on different corpora (the dataset level, Table 8). For example, converting emoticons to words (17) and decontraction (22) increase the performance of baselines for most of the models under consideration (in four cases out of six) on HateBase (hb_hate). We have also visualized the effectiveness of preprocessing techniques for each model (the model level, Table 9). For instance, tokenizing mentions (11) and emoticons (16) improved the results of the BERT baseline on all the datasets used. In the next subsections, we discuss the performance of each type of preprocessing techniques.

Table 8 Effect of techniques per dataset
Table 9 Effect of techniques per model

4.2.1 Handling of digits

Removing digits (5) degrades the performance on most of the datasets and for most of the models. For RF, this technique worsens the scores on all the corpora. For LR, the performance is increased for three out of five datasets. However, the increments are slight (0.01% on HASOC2020, 0.02% on OLID, and 0.26% on HatEval), and therefore they can be random.

Tokenizing digits (6) looks quite effective on HateBase (hb_off), OLID, and HatEval (four models out of six on each datasets). This technique shows higher scores in comparison with the baselines for three models, i.e., in half of the cases. Overall, the effect of tokenizing digits does not have a pronounced character in our experiments.

The results for converting digits to words (7) are quite broadly similar to the results of the previous technique. For OLID and HatEval, the scores are increased for most of the models (five out of six for OLID, four for HatEval). For HatEval, the performance is increased for half of the models. In the case of other datasets, the reported scores were below baselines. However, it is worth noting that converting digits to words in most cases entails performance growths for transformer-based models (four out of five datasets for BERT and three for RoBERTa). For other models, the scores are worsened for the majority of corpora. Probably, it can be related to the fact that pretrained language models are more focused on understanding words than numbers (Wallace et al. 2019; Rogers et al. 2020).

4.2.2 Handling of URLs

Removing URLs (8) shows an improvement for all datasets in the case of OLID. The URLs in OLID are replaced with a token URL, so in this case we compare the effect of removing and tokenizing URLs. For three datasets (HateBase (hb_hate), HASOC2020, and HatEval), there is a marked deterioration in performance for the greater part of models. Removing URLs also demonstrates ambiguous results from the position of model evaluation. The results are improved on the majority of datasets for three models (RF, BERT, and RoBERTa) and also worsened for most of the corpora for the three other models (LR, LSVC, CNN).

Tokenizing URLs (9) does not achieve high results. This technique reduces scores in most cases for all datasets. The results obtained among the models are similar. The performance growth is detected only for BERT (on three out of five datasets).

4.2.3 Handling of mentions

Removing mentions (10) leads to a performance decrease for all the models on HateBase (hb_off). For HASOC2020, this technique does not show any improvement for the majority of models. For OLID that contains tokenized mentions, the performance increases in most cases. Otherwise, there is no evident effect of removing mentions. The technique fails on all datasets while using LSVC and fails in most cases utilizing all types of neural networks but, more often than not, removing mentions exceeds baseline results for LR and RF.

Tokenizing mentions (11) worsened the results in most cases for all datasets. However, this technique leads to performance increases using BERT (all datasets). The results for RoBERTa are ambiguous (an increase on HateBase (hb_hate) and HASOC2020, a decrease on HateBase (hb_off) and HatEval. For traditional machine learning methods, the technique does not beat baselines in the majority of cases.

4.2.4 Handling of emoji and emoticons

The HateBase dataset does not contain emoji. Therefore, we do not evaluate the techniques related to the handling of emoji on this corpus.

Removing emoji (12) shows mixed results across the datasets (improvement using five models out of six on OLID, three models on HASOC2020, and two models on HatEval). It is successful for LR (the performance growth on all datasets), RF, and BERT (improvement for two datasets out of three). For LSVC, CNN, and RoBERTa, the scores are lower than the baseline in most cases.

Tokenizing emoji (13) increases the performance for the majority of models on HASOC2020 and HatEval but worsens the results on OLID. This technique improves scores on all datasets using LR and RF. For BERT and RoBERTa, tokenizing emoji helps to beat the baseline for two datasets out of three. For LSVC and CNN, the results are decreased in two of three cases.

Converting emoji to words (14) generally does not have a significant effect on the performance of the classification. The technique demonstrates the lower performance in most cases for all traditional approaches, CNN, and RoBERTa. For BERT, the performance is slightly improved on HASOC2020 and HatEval.

The proportion of tweets containing emoticons is not large. The largest proportion of emoticons is present in the HateBase dataset (2.11% for hb_hate and 1.5% for hb_off). For the rest of the datasets, the proportion of tweets with emoticons is less than one percent.

In our experiments, removing emoticons (15) shows an improvement in most cases for OLID and HatEval, produces mixed results for HASOC2020, and demonstrates the deterioration of the scores for HateBase. Despite the ambiguous results at the dataset level, this technique shows itself well for both transformers (improvement for three and four datasets for BERT and RoBERTa, respectively) and LR (for three datasets). For the two remaining methods, the removal of emoticons demonstrates a deterioration in most cases.

Tokenizing emoticons (16) does not have a strong influence on the results. For most datasets and models, the results are slightly improved or worsened equally. However, for BERT, the tokenization of emoticons causes a small increase across all datasets (from 0.02% to 0.66%).

Converting emoticons to words (17) shows an improvement in most or half of the models across all datasets. The technique also demonstrates an improvement of the scores in most cases for all models but RoBERTa. For RoBERTa, the results are improved only in two cases out of five.

In general, the significance of emoticons as features for the task of hate and offensive speech detection looks higher than the significance of emoji. Apparently, this is due to the fact that emoticons more often carry a pronounced sentiment. At the same time, emoji are often used to express a wider range of emotions, including fear, surprise, joy, and so on Guibon et al. (2016).

4.2.5 Handling of hashtags

Removing hashtags (18) leads to ambiguous results among the datasets. The performance decreases on HateBase (hb_off) for all models and on HASOC2020 for most models. For OLID, the scores are improved in most cases. For HateBase (hb_hate), the performance decreases for half of models. At the model level, the results are also ambiguous, but in general in most cases, this technique negatively influences the results.

Tokenizing hashtags (19) shows an absolute deterioration on the HateBase (hb_off) dataset. For the rest of the datasets, the technique does not have a significant impact on the results. At the model level, the tokenization of hashtags demonstrates a deterioration for all datasets when using LSVC and CNN. In general, the technique rather worsens the results.

Hashtag segmentation (20) also has a negative impact on the results for most models and datasets. Less deterioration is demonstrated for HASOC2020 (a decrease in performance in 50% of cases). However, HASOC2020 contains only 6% of tweets with hashtags, which is the smallest proportion among all datasets. At the model level, the results are improved for three datasets out of five using LR and LSVC. RoBERTa shows a deterioration across all datasets. For other models, the technique mainly worsens the results.

In our experiments, the processing of hashtags shows a negative impact on the performance of the models for hate and offensive speech detection. This technique does not improve the results on any dataset. That is, hashtags are generally an important and indivisible part of the text, and not as a set of meanings of the hashtag segments. However, for some traditional models (for example, LR and LSVC), improvement can be obtained by segmentation of hashtags.

4.2.6 Lexical transformations

Removing stopwords (21) demonstrates lower performance in most cases for all datasets. For HatEval, the results are worsened across all models. This technique shows some of the worst results in our experiments. A slight improvement (on three of the five datasets) is obtained only for RF. For LR, LSVC, CNN, and RoBERTa, the results are worsened for all datasets. This effect can be associated with a short tweet length. Similar results were obtained in Baruah et al. (2019), where the removal of the stopwords worsened the performance of BiLSTM. This technique also negatively affected the performance of decision trees and naive Bayes classifiers in Saeed et al. (2022) and logistic regression and support vector classifiers in Garouani et al. (2021). The authors of Do et al. (2019) noted that it is not necessary to remove all stopwords to detect hate speech in social media texts because having a few stopwords affects the results.

Decontraction (22) and replacing acronyms (23) have no evident effect on the performance. Decontraction leads to improvement in most cases for HateBase (hb_hate), OLID, and HatEval. Replacing acronyms increases the scores for HateBase (hb_off) and HatEval in most cases and for HateBase (hb_hate) and OLID in half of the cases. Both techniques worsen the results on HASOC2020. Decontraction also is not helpful for HateBase (hb_off). For each of these techniques, the performance of one-half of the models is generally improved, and the performance of other models is worsened.

Tokenizing profanity (24) shows poor results in our experiments. For four out of five datasets, the scores are deteriorated for most models. For LR, LSVC, and RoBERTa, this technique leads to degradation on all datasets.

4.2.7 Corrections

The effectiveness of spelling correction (25) varies for different datasets. For OLID and HatEval, it improves the results for most models while for HASOC2020, the scores are worsened across all models. The technique generally improves the results of LR (three datasets) and LSVC (four datasets), but for other models, it leads to a worsening of scores for most datasets.

Removing repetitions (26) mostly improves the results on HatBase (hb_off) and HASOC2020, but it is not effective on other datasets. The technique is impactful for BERT (three datasets) and RoBERTa (four datasets) and leads to a general performance degradation on for other models.

4.2.8 Summary for separate evaluation of preprocessing techniques

The choice of preprocessing techniques for hate and offensive speech detection in Twitter texts strongly depends on the specifics of the dataset and the applied model. However, our experiments on separate evaluation of preprocessing techniques allowed us to draw the following conclusions.

Mainly, the handling of digits has no positive effect on the classification performance. The impact of tokenizing digits and converting digits to words seems mixed. However, converting digits to words positively affects the performance of transformer-based models. In our experiments, removing URLs performs better than their tokenization. Probably this suggests that the presence of URLs is not associated with the presence of hate and offensive speech. For transformer-based models, removing URLs generally performs well. Based on our experiments, mentions should not be removed. It is important for solving hate and offensive speech detection tasks that the tweet is addressed to someone. Tokenizing mentions works poorly for traditional methods. However, for transformers, tokenization generally improves the baseline results. Removing and tokenizing emoji and emoticons produce no clear effect on the performance of text classification. In general, converting emojis to words worsens the results. On the contrary, by converting emoticons into words, we mainly achieve positive results. Probably, this is because emoticons usually express sentiment while emoji show a wide range of different feelings. The removal and tokenization of hashtags do not have a strong effect. We also assume that hashtags are important as complete semantic units because their segmentation worsens the results in our experiments. As regards lexical transformations, profanity tokenization and removal of stopwords negatively impact the performance of hate and offensive speech detection in Twitter texts. Our experiments show that the meaning of profane words is important for classification, so we cannot replace them with tokens. Removing stopwords shows poor results since tweets are texts of low length and stopwords represent their important semantic components. Further research might explore the impact of the proportion of removed stopwords. Decontraction and replacing acronyms demonstrate no benefit for hate and offensive speech detection. The results of spelling correction depend on the model used. This technique improves the results for two traditional methods that used bag-of-words text representations (LR and LSVC). Removing repeated letters helps BERT and RoBERTa to increase the scores improving the tokenization performance for these models.

In our experiments, we did not notice the influence of the type of harmful content (hate or offensive) on the effectiveness of certain preprocessing techniques. However, generally the removal of URLs positively influenced the performance of offensive speech detection (HateBase (hb_off) and OLID). For hate speech, its influence was mainly negative. For both offensive speech datasets, tokenizing digits mostly positively affected the performance while tokenizing hashtags, hashtag segmentation, removing stopwords, and tokenizing profanity had a negative effect. For hate speech datasets (HateBase (hb_hate) and HatEval), converting emoticons to words and decontraction generally had a positive influence on the results while removing digits, removing and tokenizing URLs, tokenizing mentions, hashtag segmentation, removing stopwords, and removing repetitions negatively affected the classification performance.

Table 10 Relative success of the technique (based on the average values of relative growths)

In Table 10, we ranked all preprocessing techniques based on the average values of relative growths. Relative growths were calculated as the value of the F1-score growth, expressed as a percentage, relative to the baseline model for each dataset. For example, the LR baseline model showed 93.02% of F1-score on HateBase (hb_hate). The same model with removing digits (5) achieved 92.83%. In this case, the relative growth is \((92.83-93.02)/93.02 * 100 = 0.2\). Table 10 ranks techniques in accordance with the average values of relative growths for each pair “model - dataset” (column “All models”) and separately for each model (columns “LR,” “RF,” “LSVC,” “CNN,” “BERT,” and “RoBERTa”). For example, the “LR” column shows the sum of relative growths for all datasets when using logistic regression. We can see that the greatest total relative growth, taking into account all the datasets, is obtained for stemming (3) and the lowest total relative growth is demonstrated for removing stopwords (21). We did not show the results for lowercase (1), lemmatization (2), and removing special characters (4), since these techniques were used for some baselines and were not used for others.

Decontraction (22), removing emoticons (15), replacing acronyms (23), removing hashtags (18), and converting emoticons to words (17) demonstrated the highest relative success across all models. However, the results varied depending on the model used. The top-5 effective techniques for LR were stemming (3), hashtag segmentation (20), decontraction (22), removing hashtags (18), and spelling correction (25). The best results for RF were archived using tokenizing profanity (24), removing hashtags (18), stemming (3), tokenizing emoji (13), and replacing acronyms (23). For LSVC, the highest scores were obtained with stemming (3), decontraction (22), hashtag segmentation (20), tokenizing digits (6), and spelling correction (25). For all traditional methods, stemming showed a fairly high performance. The highest relative success for CNN was achieved using removing emoticons (15), converting emoticons to words (17), replacing acronyms (23), decontraction (22), and spelling correction (25). The top-5 techniques for BERT also included decontraction (22), removing hashtags (18), removing emoji (12), tokenizing mentions (11), and converting digits to words (7). For RoBERTa, the most effective techniques were tokenizing hashtags (19), removing repetitions (26), removing emoticons (15), tokenizing emoji (13), and tokenizing digits (6). For both transformer-based models, tokenizing mentions (11) and emoji (13) and removing emoji (12), emoticons (15), and hashtags (18) were quite effective.

The values of the relative growths are also utilized to visualize the effectiveness of the techniques presented in Figs. 1 and 2. These figures show the performance increases or decreases obtained by a technique for each model. For instance, Fig. 1 demonstrates that the results of LR have worsened on all datasets when removing stopwords (21). For tokenizing profanity (24), the results have worsened for all datasets with the exception of HatEval.

Fig. 1
figure 1

Stacked bar diagram for relative performance growth across traditional machine learning algorithms: LR, RF, and LSVC. The indices of the techniques are listed in Table 3

Fig. 2
figure 2

Stacked bar diagram for relative performance growth across neural models: CNN, BERT, and RoBERTa. The indices of the techniques are listed in Table 3

4.3 Combinations of techniques

In this subsection, we attempt to combine the best techniques from the previous step. For this purpose, we use the following ways for technique combination.

  • Individual combination For each model and dataset, we combine the techniques that outperformed the corresponding baseline during the separate evaluation. If the improvement is achieved using incompatible techniques (for example, removing and tokenizing digits), we choose a technique that has the greatest positive effect. If both techniques showed the same results (for example, techniques 16 and 17 for HateBase (hb_off)), we evaluate both of them.

  • Combination at the model level At the model level, we combined the techniques that lead to improvement on all, or on most, datasets (see Table 9). If the improvement is achieved using incompatible techniques, we chose a technique that shows an improvement for a larger number of datasets. In the case of an equal number of datasets, we take into account the average positive effect across all datasets.

The lists of the techniques included in technique combinations are presented in Table 11.

Table 11 Technique combinations
Table 12 Results for technique combinations (F1-score, %)

The evaluation results for technique combinations are presented in Table 12. The scores that exceed baselines are shown in bold. The asterisk (*) marks absolutely best results for a specific model on a given dataset. Our results show that the combination of techniques that were successful in a separate evaluation does not always give the best effect. This is especially noticeable for transformers when both technique combinations work worse than the baselines trained on raw texts. The combination of successful preprocessing techniques works better for traditional methods. For HateBase (hb_hate), individual combinations of preprocessing techniques show the highest in-dataset results for LR and RF. For HateBase (hb_off), both combinations improve the baseline results for RF. For HASOC2020, the individual combination increases the baseline scores for LR and CNN. For CNN, the combination at the model level also exceeds the baseline. The best result for OLID dataset and RF was obtained using the corresponding combination at the model level. For HatEval, the individual combination shows the best results for RF and LSVC. The baseline result is also improved for LR (both combinations), RF and LSVC (combinations at the model level), and CNN (individual combinations).

5 Limitations

The generalizability of the results is subject to certain limitations. For instance, the messages on Twitter are restricted to 140 symbols. Therefore, the Twitter language is determined by this limitation, and it is usually unstructured and informal (Naseem et al. 2021). For the texts of other genres and other topics, the results obtained can be very different from ours.

In this study, we did not aim to identify the best algorithm for detecting hate and offensive speech in Twitter messages. The research was focused on the evaluation of preprocessing techniques using several machine learning approaches, as well as analysis and summarization of the results. This was the reason why the approaches used in this study were common and standard. Further research might explore additional types of classifiers, such as the naive Bayes classifier or other types of neural networks (long short-term memory, generative adversarial networks, etc.).

During the evaluation of the combinations (Subsection 4.3), we did not evaluate the impact of ordering the techniques within the combination on the performance of the model. This issue can be explored in further research.

In this work, we evaluated the effectiveness of the technique in terms of the presence or absence of an increase in the weighted-average F1-score using several corpora and models. The results are presented as the average values between folds. Additionally, the values of standard deviation are given. An additional study of the statistical significance of the obtained values also may be a direction for further work.

6 Conclusion

This paper investigated the effect of 26 preprocessing techniques for tweet classification using four datasets for hate and offensive speech detection. Each preprocessing technique was evaluated using three traditional machine learning methods (logistic regression, random forest, and linear support vector classifier) and three deep learning methods (convolutional neural network, bidirectional encoder representations from transformers (BERT), and RoBERTa). We used a bag-of-words text representation for traditional methods and FastText for CNN. Both text representations are widely utilized for text classification. In this work, we separately evaluated the techniques and made conclusions about their effectiveness for different datasets and methods. We examined a large number of techniques that have not been evaluated in a comparative study in the past. We divided the preprocessing techniques into categories in accordance with their effectiveness at the model and dataset levels and ranked them based on the relative success across the datasets. We demonstrated that some techniques provided better results in classification for some models, while others decreased the scores. Similar to previous research, our results showed that the effectiveness of text preprocessing techniques varies across different datasets and methods. Therefore, the choice of the method for classifying texts of a particular dataset is a crucial step for hate and offensive speech detection.

Combining preprocessing techniques can also produce different results. We explored two ways to combine successful techniques. In general, technique combinations performed better for traditional methods than for deep learning methods. For transformer-based models, the results obtained using technique combinations were the worst among all models.

In future studies, we will investigate these techniques in different domains and explore other ways to construct effective combinations of preprocessing techniques. A further study could also assess higher-level preprocessing techniques, such as identifying synonyms and other semantic relations, as well as more detailed lexical features, such as detecting urban language and jargon. Another important issue for future research is testing these techniques on fine-grained datasets for hate and offensive speech detection.