1 Introduction

The pervasiveness of offensive content in social media has motivated the development of computational models that can identify the various forms of content such as aggression (Kumar et al., 2018, 2020; Casavantes et al., 2023), cyber-bullying (Rosa et al., 2019), sentiment (Kanfoud & Bouramoul, 2022; Vohra & Garg, 2023), emotion(Skenduli et al., 2021; Abdi et al., 2021) and hate speech (Davidson et al., 2017). Prior work has generally focused either on identifying conversations that are likely to derail (Zhang et al., 2018; Chang et al., 2020) or on identifying offensive content within posts, comments, or documents. This has also been the goal of recent popular competitions, e.g., SemEval 2019 and 2020 (Basile et al., 2019; Zampieri et al., 2019b, 2020; Modha et al., 2022; Satapara et al., 2023).

Substantial progress has been made on identifying offensive language in conversations and posts. Recently, with the goal of improving explainability, multiple post-level offensive language datasets have been annotated with respect to related phenomena such as humor (Meaney et al., 2021a), gender bias (Risch et al., 2021), and offensive token identification (Mathew et al., 2021). Prior work has addressed these tasks in isolation, building separate models for each (Ranasinghe & Zampieri, 2021). However, this study hypothesizes that jointly modeling these tasks with post-level offensive language identification would help improving the explainability of offensive language identification models. For example, systems capable of detecting offensive token spans would allow content moderators to quickly identify objectionable parts of the posts, especially in long posts. Furthermore, this would allow moderators to more easily approve or reject the decisions of offensive language detection systems. Since these tasks are related, an ideal scenario for multi-task learning (MTL) emerges.

In MTL (Caruana, 1997), a model is designed to learn multiple tasks simultaneously using the same set of data. Parameters allocated for two (or more) tasks are shared throughout the optimization process (training), yielding models that often outperform single-task learning (STL) models while reducing potential overfitting (Zhang & Yang, 2022). Finally, given that only a single model is produced by MTL compared to the multiple individual models (one for each task) produced by STL, MTL is generally more environmentally friendly, demanding less computing resources (e.g. disk, memory) and energy. This addresses recent efforts in Green AI (Schwartz et al., 2020) as well as a growing interest in the NLP community to steer towards both explainable AI and green AI (Danilevsky et al., 2020), as evidenced by recent workshops such as SustaiNLP.Footnote 1

In this paper, we propose an MTL approach that jointly tackles both token-level and post-level offensive language identification related tasks. As a first step, we jointly model token-level and post-level offensive language identification in a unified system. Then, we extend the MTL approach to model post-level related tasks. To the best of our knowledge, this is the first detailed evaluation of MTL in offensive language identification at both the post and token levels. With our MTL approach, we address four research questions based on performance, speed, efficiency and generalizability, which we describe in detail in Section 4.1.

The main contributions of this work are:

  1. 1.

    We develop an MTL model that learns the following jointly: (a) token-level and post-level offensive language identification for English; (b) post-level offensiveness language identification and related tasks in a multilingual setting.

  2. 2.

    We evaluate the resulting MTL model in terms of performance, efficiency, and generalization ability. We show that the proposed MTL model performs better than STL models at the post- and token-level and is noticeably faster than training multiple STL models. We also evaluate the performance of our model on datasets containing Arabic, Bengali, German, Hindi, and Meitei data.

  3. 3.

    We test the MTL model in zero-shot and few-shot learning scenarios and show that MTL performs better than STL models when there are fewer training data samples available. We demonstrate that MTL is better for zero-shot learning and that MTL generalises well across different languages and domains.

  4. 4.

    We make the code and the trained models freely available to the community. Our complete multi-task framework; MAD (Multi-task Aggression Detection Framework) will be released as an open-source Python package.

2 Related work

MTL has been employed extensively in computer vision (Girshick, 2015; Zhao et al., 2018) and in NLP tasks such as part-of-speech tagging and named entity recognition (Collobert & Weston, 2008), text classification (Liu et al., 2017), natural language generation (Liu et al., 2019a), and offensive language identification (Dai et al., 2020).

Talat et al. (2018) trained an MTL model on different post-level offensive language detection tasks and found that MTL vastly improves the performance on each task, allowing the overall model to strongly generalize to unseen datasets. In Abu Farha and Magdy (2020) sentiment prediction was used as an auxiliary task to detect offensive and hate speech in an MTL setup using a CNN-Bi-LSTM model. Past studies have also demonstrated the value of using neural transformer multi-task learning models to achieve competitive results in offensive language identification shared tasks. Dai et al. (2020) trained an MTL model on all three levels of the OLID dataset (Zampieri et al., 2019a) while Djandji et al. (2020) trained an MTL model on two post-level tasks – identifying offense and identifying hate speech in Arabic texts using AraBERT (Antoun et al., 2020). Nelatoori and Kommanti (2022) uses a BiLSTM based MTL model to identify toxic comments and spans. In recent work, Mathew et al. (2021) designed an MTL model based on transformers to detect both token-level and post-level offensive language. Apart from a few notable exceptions (e.g. MUDES (Ranasinghe & Zampieri, 2021)), due to the lack of suitable available datasets, there has not been much work in developing statistical learning models that can detect offensive tokens. Our work fills this gap.

None of the studies discussed in this section provided an empirical evaluation of MTL in few-shot, zero-shot, and multilingual settings. Furthermore, these studies have not experimented with MTL in multilingual settings. To address this important gap, we evaluate MTL across different settings for token and post-level offensive language identification. Finally, we evaluate our MTL architecture on several post-level tasks related to offensive language identification across a wide range of languages.

3 Multitask architecture

Considering the success that transformers have demonstrated in various offensive language identification tasks, we chose to employ a transformer as the base model for our MTL approach. Our approach will learn several tasks jointly including post-level tasks and token-level tasks. The implemented architecture shares hidden layers between both post and token-level tasks. The shared portion includes a transformer model that learns shared representations (and extracts information) across tasks by minimizing a combined/compound loss function. The task-specific classifiers receive input from the last hidden layer of the transformer language model and predict the output for each of the tasks (details provided in the next two sections).

Post-level aggression detection

By utilizing the hidden representation of the classification token (CLS) within the transformer model, we predict the target labels (offensive/hate speech/normal) by applying a linear transformation followed by the softmax activation function (\(\sigma \)):

$$\begin{aligned} \hat{\textbf{y}}_{post} = \sigma (\textbf{W}_{[CLS]} \cdot \textbf{h}_{[CLS]} + \textbf{b}_{[CLS]}) \end{aligned}$$
(1)

where \(\cdot \) denotes matrix multiplication, \(\textbf{W}_{[CLS]} \in \mathcal {R}^{D \times 3}\), \(\textbf{b}_{[CLS]} \in \mathcal {R}^{1 \times 2}\), and D is the dimension of input layer \(\textbf{h}\) (top-most layer of the transformer).

Token-level aggression detection

We predict the token labels (toxic/non-toxic) by also applying a linear transformation (also followed by the softmax) over every input token from the last hidden layer of the model:

$$\begin{aligned} \hat{\textbf{y}}_{token} = \sigma (\textbf{W}_{token} \cdot \textbf{h}_t + \textbf{b}_{token}) \end{aligned}$$
(2)

where t marks which token the model is to label within a T-length window/token sequence, \(\textbf{W}_{token} \in \mathcal {R}^{D \times 2}\), and \(\textbf{b}_{token} \in \mathcal {R}^{1 \times 2}\).

In the MTL setting, we used different strategies to combine the losses from different type of related tasks; token-level tasks and post-level tasks. We will explain these in the following two sections.

Table 1 Four instances sampled/extracted from the dataset along with their respective annotations (Mathew et al., 2021)

4 Post-level and token-level

Data

The HateXplain dataset (Mathew et al., 2021) is, to our knowledge, the first benchmark dataset that contains both post and token-level annotations of hate speech and offensiveness. The dataset contains data collected from Twitter and Gab and is annotated using Amazon Mechanical Turk. Each instance in the dataset is annotated by three annotators that choose between three categories - label (“offensive”, “hate speech”, and “normal”), rationales (tokens based on which the labeling decision was made), and target communities (the group of people denounced in the post). We present examples from the dataset in Table 1.

The dataset contains 20, 148 posts (9, 055 from Twitter and 11, 093 from Gab), out of which 5, 935 instances are hateful, 5, 480 are offensive, and 7, 814 are normal. The dataset also contains 919 undecided posts where all three annotators annotated the label differently. We present the statistics in Table 2. For the task of interest, we used the labels and rationales from the HateXplain dataset. A majority vote strategy, where half or more of the annotators agree on an annotation, was used to determine the final annotation of the label and individual tokens in the rationales. We removed the 919 undecided annotations from the final database. The dataset was further split into 11, 535 train, 3, 844 development (dev), and 3, 844 test subsets. The distribution of labels and tokens of the final processed dataset is shown in Table 2. We observe that the train, dev, and test sets follow a similar imbalanced class distribution.

Table 2 The distribution of hate speech, offensive, and normal posts and the number of toxic and non toxic tokens in train, development (dev), and test sets

To evaluate how well our architecture performs in zero-shot environments, we used five publicly available offensive language detection datasets released as part of OffensEval 2020, presented in Table 3. Since these datasets have been annotated at the instance level, we followed the evaluation process as explained in Section 4.2.

Table 3 Language (Lang.), instances (Inst.), sources (S), and the source reference for each dataset

4.1 Experimental setup

MAD consists of two main parts, as depicted in Fig. 1. The first part is the language modeling component which runs masked language modeling (MLM) on the given dataset. By default, the modeler masks \(15\%\) of the tokens randomly in the dataset and considers sequences with a maximum length of 512. The model weights are stored and loaded into the next part/stage of the MAD framework (Sarkar, 2021). The second part/stage of the MAD framework is the multi-task architecture presented in Section 3. It starts by loading the model saved from the first stage to then perform MTL (Sarkar, 2021).

Fig. 1
figure 1

The two components of the MAD framework. Section A depicts the language modeling component. Section B shows the multi-task aggression detection classifier – the post label predicts post-level aggression; a token label of 0 and 1 denotes non-toxic and toxic tokens, respectively (Ranasinghe & Zampieri, 2021)

We train the system by minimising the cross-entropy loss for both (constituent) tasks as defined in Eq. 5, where \(\textbf{y}_{post}\) and \(\textbf{y}_{token}\) represent ground true label vectors (one-hot encodings of the label integers). These particular losses are:

$$\begin{aligned} \mathcal {L}_{post}&=-\sum \limits ^3_{i=1} \left( \textbf{y}_{post} \odot \log ( \hat{\textbf{y}}_{post} ) \right) [i] \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{token}&=-\sum \limits ^2_{i=1} \left( \textbf{y}_{token} \odot \log ( \hat{\textbf{y}}_{token} ) \right) [i] \end{aligned}$$
(4)

where \(\textbf{v}[i]\) retrieves the ith item in a vector \(\textbf{v}\) and \(\odot \) indicates element-wise multiplication. In combining the above two losses into one objective, we introduced \(\alpha \) and \(\beta \) parameters to balance the importance of the tasks. To assign equal importance to each task in our experiments, we set \(\alpha = \beta = 1\) in this study. The full loss is:

$$\begin{aligned} \mathcal {L}_{MAD} = \frac{\alpha \mathcal {L}_{post} + \beta \mathcal {L}_{token} }{\alpha + \beta } {.} \end{aligned}$$
(5)

We set up two STL baselines – post-level and token-level aggression detection models (each based on neural transformers). The post-level STL model takes the complete sentence as an input and predicts the aggression label – “normal”, “offensive”, or “hate speech” – using a softmax classifier on top of the CLS token (activation vector). Note that the token-level STL model predicts whether each token (word) in the sentence is toxic or not using a softmax classifier as well. We performed experiments using BERT-base-cased (Devlin et al., 2019) and RoBERTa-base (Liu et al., 2019b) transformer model variants, available in the HuggingFace model repository. We also performed experiments using BERT-base-cased and RoBERTa-base models retrained on HatEval (Basile et al., 2019) and OLID (Zampieri et al., 2019a) datasets using MLM; the shifted models are denoted by the H\(_2\)O suffix. Furthermore, we used the recently released fBERT model (Sarkar et al., 2021) which is a retrained BERT-base-cased model on over 1.4 million offensive instances from SOLID (Rosenthal et al., 2021).

For all of the experiments, we optimized parameters with the AdamW update rule using a learning rate of \(1e-4\), a maximum sequence length of 128, and a batch size of 16 samples. Early stopping was also executed if the validation loss did not improve over 10 iterations. The models were trained using a 16 GB Tesla P100 GPU over three epochs. All experiments were run using ten different random seeds, and the mean value plus the standard deviation score across these experiments were reported. We did not perform any data pre-processing steps and used the same datasets published.

Finally, to better cope with class imbalance, we have chosen the macro F\(_1\) score as the evaluation measure for all tasks. For the post-level evaluation, we used a macro F\(_1\) score that is computed as a mean of per-class F\(_1\) scores, as shown below:

$$\begin{aligned} F_{1} = \frac{F_{1}(\textit{Off}) + F_{1}(\textit{Hate}) + F_{1}(\textit{Normal})}{3} {.} \end{aligned}$$
(6)

If the total number of instances is n, the final aggregated F\(_1\) score A for the token-level task is:

$$\begin{aligned} A = \frac{1}{n}\sum _{i=1}^{n} F_{1}(\textit{Per Instance}) {.} \end{aligned}$$
(7)

4.2 Results and analysis

In this section, we answer each of our following research questions (RQs).

  • RQ1- Performance: Can MTL outperform STL in (a) token-level and post-level offensive language identification, (b) post-level offensiveness language identification and related tasks?

  • RQ2 - Speed: Is the proposed MTL approach, in which either two tasks are learned jointly, faster than separate STL models for post- and token-level offensive language identification?

  • RQ3 - Efficiency: Can MTL learn from fewer training samples compared to a STL setup?

  • RQ4 - Generalizability: How well does MTL perform in different domains and languages in zero-shot environments compared to STL models?

4.2.1 Supervised learning

We start by first answering RQ1. We train our MAD framework on the HateXplain training sets and evaluate on the test sets. In Table 5, we compare the results of doing this to the STL setup. We achieve the best result for both the token-level and post-level with our MAD framework model. The fBERT model achieves the overall highest macro F\(_1\) score for the token-level aggression detection using MAD. The RoBERTa-H\(_2\)O model achieves a macro F\(_1\) score of 0.6949 at the post-level using the proposed multi-task learning framework. The re-trained language models with MTL achieve better results than the STL model across tasks. Based on these results, we can empirically conclude that MTL outperforms STL in both token-level and post-level offensive language identification by sharing information across the two tasks.

Table 4 Performance comparison of two STL models and the MAD framework model
Fig. 2
figure 2

The test F\(_1\) with an increasing number of training samples with RoBERTa-H\(_{2}\)0 in STL & MAD setups

To answer RQ2, we compared the performance of the STL baseline models to our MAD models with respect to computing resources. The results are shown in Table 4. Desirably, we observe that the MAD framework model outperformed the STL models combined for every metric shown/measured. MTL uses less RAM than the token-level model and the training time per epoch is less compared to the post-level model. This demonstrates that MTL is faster and more resource efficient than separate STL models for post and token-level offensive language identification, an insight that should prove to be beneficial for real world applications.

4.2.2 Few-shot learning

One advantage of multi-task learning is that less data is required to generalize, owing to the fact that information is shared across related tasks; hence it reduces the strict need for a large, labeled dataset (Caruana, 1997). Motivated by this potential benefit, we answer our third RQ: Can MTL learn from fewer training instances? by comparing the MAD model framework performance with STL baseline models when the number of training instances is limited. We conduct experiments (see Fig. 2 for resulting plot) for the RoBERTa-H\(_{2}\)0 model, which performed the best in the previous experiment.

Figure 2 depicts that MTL consistently outperforms STL when varying the number of training instances for both post and token-level aggression detection tasks. This result demonstrates the generalization ability of MTL even when the number of labeled instances available is low. We conclude that the MTL setup performs much better when the number of samples is limited, which is particularly the case for low resource language problems.

4.2.3 Zero-shot learning

We answer our fourth RQ by evaluating the MTL approach in the zero-shot setting, comparing it with heuristics based on STL. We use the datasets described in Section 4. Since these datasets only contain annotations at the post-level, we carried out the evaluation only at the post-level.

We consider three heuristics to train in the context of zero-shot learning: (1) We train a post-level offensive language identification (transformer) model – a softmax layer is added on top of the CLS token. We train on HateXplain post-level annotations, saving weights. Then we perform zero-shot learning on OffensEval 2020 languages using the saved weights (model is named Post\({_{zeroshot}}\)). Since the data is labeled as Offensive/Not Offensive, we concatenate predicted offensive and hate speech labels to create a single, offensive label. (2) We train a span-level offensive language identification model based on transformers using MUDES (Ranasinghe & Zampieri, 2021). We train the model on HateXplain token-level annotations, saving the weights. Then we run the model on OffensEval 2020 languages and if the model predicts at least one offensive token, our system labels that post as offensive. This is consistent with OLID annotation guidelines (Zampieri et al., 2019a). We call this model Token\({_{zeroshot}}\). (3) We train our MTL architecture on HateXplain post-level and token-level annotations, saving the weights. Then we perform zero-shot learning at the post-level of the OffensEval 2020 languages using the saved weights. We call this model MTL\({_{zeroshot}}\). Since the other datasets are labeled as Offensive/Not Offensive, we concatenate predicted offensive and hate speech labels to craft one label.

Table 5 The test set mean macro F\(_1\) scores of 10 runs as well as the standard deviation for different transformer models

For OLID (Zampieri et al., 2019a), we used the best model that we obtained on the HateXplain dataset – the RoBERTa-H\(_{2}\)0 model. Following the strong cross-lingual offensive language identification results obtained using XLM-R (, 2020), we used the xlm-RoBERTa-base (Conneau et al., 2020) model for the non-English datasets. Since this is a purely zero-shot setup, we do not compare our results to systems that were specifically trained on these datasets (Tables 5 and 6).

Table 6 Data Properties: Number of instances (Int.), data sources (S), and label types in all datasets
Table 7 Results ordered by Macro F1 for Arabic (AR), Danish (DA), English (EN), Greek (GR), and Turkish (TR) datasets for zero shot experiments

As observed in Table 7, MTL\({_{zeroshot}}\) outperforms the other zero-shot approaches that are based on STL for all the languages. This affirmatively answers our fourth and final research question: MTL outperform STL models in zero-shot scenarios, demonstrating strong generalization ability.

5 Related tasks

Data

We used four publicly available datasets: ComMA (Kumar et al., 2021), GermEval (Risch et al., 2021), Hahackathon (Meaney et al., 2021a), and OSACT4 (Mubarak et al., 2020), summarized in Table 6.

5.1 Experimental setup

For these datasets, we also use/employ the MAD framework as showed in Fig. 1. However, since these related tasks only contained post-level labels, we did not use the token heads in the MTL architecture. Instead, we used/adapted multiple post-level heads. We train our MTL model by minimising the cross-entropy loss for all of the (inherent) tasks. All of the losses (for all tasks) are then combined into one objective and we assign equal importance to each task in our experiments. The full loss is:

$$\begin{aligned} \mathcal {L}_{MAD} = \frac{\sum ^n_{j=1}\mathcal {L}_{j}}{n} \end{aligned}$$
(8)

where n is the number of tasks and \(\mathcal {L}_{j}\) is the loss function associated with task j.

Similar to the previous experiments, we compared the MTL architecture with STL post-level baselines where the STL model takes in the complete sentence as an input and predicts the post label using a softmax classifier on top of the CLS token (activation vector). We performed experiments using multilingual transformer models, such as mBERT and XLM-R, as well as monolingual transformer models, specifically ones that were trained specifically to support each language. For ComMA, we used IndicBERT (Kakwani et al., 2020) which supports Bengali, Hindi, and English. For GermEval we used gBERT (Chan et al., 2020) and gELECTRA (Chan et al., 2020) while for OSACT4, we used AraBERT and AraElectra.

Similar to the previous set of experiments, for all tasks in this section, we optimized parameters (with AdamW) using a learning rate of \(1e-4\), a maximum sequence length of 128, and a batch size of 16 samples. Early stopping was also executed if the validation loss did not increase over a 10 iteration period. The models were trained using a 16 GB Tesla P100 GPU over three epochs. The output results of neural transformer models can heavily depend on the initial weights and, more importantly, on the experimental and simulation setup (Ein-Dor et al., 2020). The standard procedure to address this variation is to run the transformer model in different random seeds and report the mean and standard deviation of multiple runs (Risch & Krestel, 2020; Ein-Dor et al., 2020; Mosbach et al., 2021). Recent literature suggests that running experiments ten times provides more reliable results (Sellam et al., 2022), therefore, all experiments were ran for ten different random seeds with reported mean values over 10 trials with standard deviation. For the evaluation of each task, we used the same evaluation metrics used by the authors of the original datasets (Table 8).

Table 8 Performance of 3 STL models versus MTL model for ComMa with mBERT

5.2 Results and analysis

To evaluate our proposed multi-task learning model, we experimented with it on the four datasets. For the evaluation of each task, we used the same evaluation metrics used by the authors of each of the original datasets.

Again, we start by answering RQ1, we compared the results of the MTL to STL models across all datasets presented in Section 4. We train each model on the training set of each database and evaluate it on the relevant test set. In Table 9 we present the mean results of ten runs along with standard deviation.

Table 9 Micro F\(_1\) scores of different models for the four datasets, except for Hahackathon Task 2, where the results reported are with respect to RMSE

We observe that MTL consistently outperforms STL in all tasks across all of the datasets. For ComMA, mBERT (Devlin et al., 2019) with MTL performed best across all the tasks. For GermEval 2021, the gElectra (Chan et al., 2020) model with MTL outperformed all of the other models. For Hahackathon, fBERT (Sarkar et al., 2021) with MTL performed best and, finally, for OSACT4 2020, AraBERT with MTL produced the best results across all tasks. Note that, for all of the transformer models we experimented with, MTL variants achieve better performance than STL ones.

To answer RQ2, we compared the performance of the STL baseline models with our MTL models with respect to computing resources (using the best transformer model of each dataset). The results of this comparison are shown in Table 8 – observe that the model used within the proposed MTL framework outperforms the STL models combined for every metric across all datasets.

6 Conclusion

In this paper, we introduce MAD, a multi-task architecture based on neural transformers, and evaluated it across different key training setups. To the best of our knowledge, this is the first empirical evaluation of MTL in both post-level and token-level offensive language identification.

This work demonstrates that the proposed MTL model outperforms STL models on both the tasks of token-level and post-level offensive language identification (RQ1). We furthermore demonstrated that our MTL model uses less resources (in terms of RAM usage, GPU usage, and training and inference time) than the two STL models combined, showing that MTL could prove valuable for practical applications (RQ2). Furthermore, we experimented with MTL in a few-shot setup and demonstrated that it could desirably outperform STL models when the amount of training data available is small, confirming that MTL could prove useful for even low resource language problems (RQ3). When considering the zero-shot setup, we showed that MTL outperforms STL-based heuristics across five different datasets. This showcased that MTL models generalize better across datasets than STL models (RQ4).

Finally, as only a single machine learning model is produced by MTL compared to the multiple statistical learning models produced in a standard STL approach, we show that our proposed approach not only achieves higher performance but is also faster than STL. MTL is, therefore, environmentally friendlier as it demands less computing resources and energy compared to STL. This efficient use of resources addresses recent efforts in Green AI (Schwartz et al., 2020) and the recent ACL Policy Document on Efficient NLP.Footnote 2

With respect to future work, we would like to expand MTL-based offensive language identification to operate with more languages and domains by annotating additional datasets. We believe that our MTL systems improve interpretability as well as generalizability over the current state-of-the-art post-level offensive language identification models, offering a powerful neural transformer-based framework for the development of future, promising offensive content and language identification systems and applications. Finally, we would like to use MTL to explore the interplay between offensive content and sarcasm as in the recent HaHackathon competition at SemEval-2021 (Meaney et al., 2021b). BERT-based models have been successfully applied to sarcasm detection (Castro et al., 2019; Pandey & Singh, 2023) suggesting that the approach presented in this paper would potentially achieve good performance on identified sarcasm in an MTL setting.