1 Introduction

The widespread of offensive content online such as hate speech and cyber-bullying is a global phenomenon. This has sparked interest in the AI and NLP communities motivating the development of various systems trained to automatically detect potentially harmful content (Ridenhour et al. 2020). Even though thousands of languages and dialects are widely used in social media, the clear majority of these studies consider English only. This is evidenced by the creation of many offensive language resources for English such as annotated datasets (Rosenthal et al. 2021), lexicons (Bassignana et al. 2018), and pre-trained models (Sarkar et al. 2021).

More recently researchers have turned their attention to the problem of offensive content in other languages such as Arabic (Mubarak et al. 2021), French (Chiril et al. 2019), Greek (Pitenis et al. 2020), and Portuguese (Fortuna et al. 2019), to name a few. In doing so, they have created new datasets and resources for each of these languages. Competitions such as OffensEval (Zampieri et al. 2020) and TRAC (Kumar et al. 2020) provided multilingual datasets compiled and annotated using the same methodology. The availability of multilingual has made it possible to explore data augmentation methods (Ghadery and Moens 2020), multilingual word embeddings (Pamungkas and Patti 2019), and cross-lingual contextual word embeddings (Ranasinghe and Zampieri 2020).

In this paper, we revisit the task of offensive language identification for low-resource languages, that is, languages for which few or no corpora, datasets, and language processing tools are available. Our work focus on Marathi, an Indo-Aryan language spoken by over 80 million people, most of whom live in the Indian state of Maharashtra. Even though Marathi is spoken by a large population, it is relatively low-resourced compared to other languages spoken in the region, most notably Hindi, the most similar language to Marathi. We collect and annotate data from Twitter to create the largest Marathi offensive language identification dataset to date. Furthermore, we train a number of state-of-the-art computational models on this dataset and evaluate the results in detail which makes this paper the first comprehensive evaluation on Marathi offensive language online.

This paper presents the following contributions:

  1. 1.

    We release MOLD 2.0,Footnote 1 the largest annotated Marathi Offensive Language Dataset to date. MOLD 2.0 contains more than 3600 annotated tweets annotated using the popular OLID (Zampieri et al. 2019) three-level hierarchical annotation schema; (A) Offensive Language Detection (B) Categorization of Offensive Language (C) Offensive Language Target Identification.

  2. 2.

    We experiment with several machine learning models including state-of-the-art transformer models to predict the type and target of offensive tweets in Marathi. To the best of our knowledge, the identification of types and targets of offensive posts have not been attempted on Marathi.

  3. 3.

    We explore offensive language identification with cross-lingual embeddings and transfer learning. We take advantage of existing data in high-resource languages such as English and Hindi, to project predictions to Marathi. We show that transfer learning can improve the results on Marathi which could benefit a multitude of low-resource languages.

  4. 4.

    Finally, we investigate semi-supervised data augmentation. We create SeMOLD, a larger semi-supervised dataset with more than 8000 instances for Marathi. We use multiple machine learning models trained on the annotated training set and combine the scores following a similar methodology described in Rosenthal et al. (2021). We show that this semi-supervised dataset can be used to augment the training set which leads to improves results of machine learning models.

The development MOLD 2.0 and SeMOLD open exciting new avenues for research in Marathi offensive language identification. With these two resources, we aim to answer the following research questions:

  • RQ1: To which extent is it possible to identify types and targets of offensive posts in Marathi?

  • RQ2: Our second research question addresses data scarcity, a known challenge for low-resource NLP. We divide it in two parts as follows:

    • RQ2.1: How does data size influences performance in Marathi offensive language identification?

    • RQ2.2: Do available resources from resource-rich languages combine with transfer-learning techniques aid the identification of types and targets in Marathi offensive language identification?

Previous work Gaikwad et al. (2021) has addressed the identification of offensive posts in Marathi, but the types and targets included in offensive posts, the core part of the popular OLID taxonomy (Zampieri et al. 2019), have not been addressed for Marathi. Finally, with respect to data size and transfer learning, we draw inspiration on recent work that applied cross-lingual models for low-resource offensive language identification (Ranasinghe and Zampieri 2020, 2021) applying it to Marathi.

2 Related work

The problem of offensive content online continues to attract attention within the AI and NLP communities. In recent studies, researchers have developed systems to identify whether a post or part thereof is considered offensive (Ranasinghe et al. 2021) or to predict whether conversations will go awry (Zhang et al. 2018). Popular international competitions on the topic have been organized at conferences such as HASOC (Mandl et al. 2019; Modha et al. 2021), HatEval (Basile et al. 2019), OffensEval (Zampieri et al. 2020), and TRAC (Kumar et al. 2018, 2020). These competitions attracted a large number of participants and they provided participants with various of important benchmark datasets.

A variety of computing models have been proposed to tackle offensive content online ranging from classical machine learning classifiers such as SVMs with feature engineering (Dadvar et al. 2013; Malmasi and Zampieri 2017) to deep neural networks combined with word embeddings (Aroyehun and Gelbukh 2018; Hettiarachchi and Ranasinghe 2019). With the recent development of large pre-trained transformer models such as BERT and XLNET (Devlin et al. 2019; Yang et al. 2019), several studies have explored the use of general pre-trained transformers (Liu et al. 2019; Ranasinghe and Hettiarachchi 2020) while others have worked on fine-tuning models on offensive language corpora such as fBERT (Sarkar et al. 2021).

In terms of languages, due to the availability of suitable datasets, the vast majority of studies in offensive language identification use English data (Yao et al. 2019; Ridenhour et al. 2020). In the past few years, however, more offensive language dataset have been for languages other than English such as Arabic Mubarak et al. (2021), Dutch (Tulkens et al. 2016), French (Chiril et al. 2019), German (Wiegand et al. 2018), Greek (Pitenis et al. 2020), Italian (Poletto et al. 2017), Portuguese (Fortuna et al. 2019), Slovene (Fišer et al. 2017), Turkish (Çöltekin 2020), and many others. To the best of our knowledge, the only Marathi dataset available to date is the aforementioned Marathi Offensive Language Dataset (MOLD) (Gaikwad et al. 2021), a manually annotated dataset containing nearly 2500 tweets. Our work builds on MOLD by applying the same data collection methods to expand it in terms of both size and annotation.

Finally, multilingual offensive language identification is a recent trend that takes advantage of large pre-trained cross-lingual and multilingual models such as XLM-R (Conneau et al. 2019). Using this architecture, it is possible to leverage available English resources to make predictions in languages with less resources helping to cope with data scarcity in low-resource languages (Ranasinghe and Zampieri 2020; Ranasinghe et al. 2021).

3 Data collection

MOLD 2.0 builds on the research presented in Gaikwad et al. (2021) which introduced MOLD 1.0. The annotation of both MOLD 1.0 and MOLD 2.0 follows the OLID annotation taxonomy which includes three levels (labels in brackets):

  • Level A: Offensive (OFF)/Non-offensive (NOT).

  • Level B: Classification of the type of offensive (OFF) tweet—Targeted (TIN)/Untargeted (UNT).

  • Level C: Classification of the target of a targeted (TIN) tweet—Individual(IND)/Group(GRP) or Other(OTH).

Our initial dataset (MOLD 1.0) consisted of nearly 2,500 tweets. As shown in Table 1, we collected 1,100 additional instances for MOLD 2.0 resulting in a dataset of 3,611 tweets according to the same methodology described in Gaikwad et al. (2021). Data collection was carried out with a data extraction script which utilized the TweepyFootnote 2 library along with the API provided by Twitter.

Table 1 MOLD v2.0—distribution of label combinations

As MOLD 1.0 was only annotated on OLID Level A, in MOLD 2.0 we expand the annotation to the full three-level OLID taxonomy annotating Level B and Level C. Examples from the dataset along with English translation are presented in Table 2. The annotation was carried out by the 3 native speakers of Marathi. The annotators were a mix of male (1) and female (2) Master’s students working in the project. We provided the annotators with guidelines on how to annotate the data and supervised the process with periodic meetings to make sure they were correctly following the guidelines. We report an inter-annotator agreement of 0.79 Cohen’s kappa (Carletta 1996) on the three levels.

Table 2 Four tweets from the dataset, with their labels for each level of the annotation schema

Finally, following the same methodology described for MOLD 2.0, we collected an additional 8000 instances from Twitter to create SeMOLD, a larger dataset with semi-supervised annotation presented in Sect. 6.

4 Experiments and evaluation

We experimented with several machine learning models trained on the training set, and evaluated by predicting the labels for the held-out test set. As the label distribution is highly imbalanced, we evaluate and compare the performance of the different models using macro-averaged F1-score. We further report per-class Precision (P), Recall (R), and F1-score (F1), and weighted average. Finally, we compare the performance of the models against simple majority and minority class baselines.

4.1 SVC

Our simplest machine learning model is a linear support vector classifier (SVC) trained on word unigrams. Before the emergence of neural networks, SVCs have achieved state-of-the-art results for many text classification tasks (Schwarm and Ostendorf 2005; Goudjil et al. 2018) including offensive language identification (Zampieri et al. 2019; Alakrot et al. 2018). Even in the neural network era, SVCs produce an efficient and effective baseline.

4.2 BiLSTM

As the first embedding-based neural model, we experimented with a bidirectional long short-term-memory (BiLSTM) model, which we adopted from a pre-existing model for Greek offensive language identification (Pitenis et al. 2020). The model consists of (i) an input embedding layer, (ii) two bidirectional LSTM layers, and (iii) two dense layers. The output of the final dense layer is ultimately passed through a softmax layer to produce the final prediction. The architecture diagram of the BiLSTM model is shown in Fig. 1. Our BiLSTM layer has 64 units, while the first dense layer had 256 units.

Fig. 1
figure 1

The BiLSTM model for Marathi offensive language identification. The labels are a input embeddings, b, c two BiLSTM layers, d, e fully connected layers; f softmax activation, and g final probabilities

4.3 CNN

We also experimented with a convolutional neural network (CNN), which we adopted from a pre-existing model for English sentiment classification (Kim 2014). The model consists of (i) an input embedding layer, (ii) 1-dimensional CNN layer (1DCNN), (iii) max pooling layer and (iv) two dense layers. The output of the final dense layer is ultimately passed through a softmax layer to produce the final prediction (Fig. 2).

Fig. 2
figure 2

CNN model for Marathi offensive language identification. The labels are a input embeddings, b 1DCNN, c max pooling, d, e fully connected layer; f with dropout, g softmax activation, and h final probabilities

For the BiLSTM and CNN models presented above, we set three input channels for the input embedding layers: pre-trained Marathi FastText embeddingsFootnote 3 (Bojanowski et al. 2017), Continuous Bag of Words Model for MarathiFootnote 4 (Kumar et al. 2020) as well as updatable embeddings learned by the model during training.

4.4 Transformers

Fig. 3
figure 3

Transformer model for Marathi offensive language identification (Ranasinghe and Zampieri 2020)

Finally, we experimented with several pre-trained transformer models. With the introduction of BERT (Devlin et al. 2019), transformer models have achieved state-of-the-art performance in many natural language processing tasks (Devlin et al. 2019) including offensive language identification (Ranasinghe and Zampieri 2020; Ranasinghe et al. 2021; Sarkar et al. 2021; Ranasinghe et al. 2021). From an input sentence, transformers compute a feature vector \(\varvec{h}\in {\mathbb {R}}^{d}\), upon which we build a classifier for the task. For this task, we implemented a softmax layer, i.e., the predicted probabilities are \(\varvec{y}^{(B)}={{\,\mathrm{softmax}\,}}(W\varvec{h})\), where \(W\in {\mathbb {R}}^{k\times d}\) is the softmax weight matrix and k is the number of labels. In our experiments, we used three pre-trained transformer models available in HuggingFace model hub (Wolf et al. 2020) that supports Marathi; mBERT (Devlin et al. 2019), xlm-roberta-large (Conneau et al. 2019) and IndicBERT (Kakwani et al. 2020). The implementation was adopted from the DeepOffense Python library.Footnote 5 The overall transformer architecture is available in Fig. 3.

For the transformer-based models, we employed a batch-size of 16, Adam optimizer with learning rate \(2\mathrm {e}{-5}\), and a linear learning rate warm-up over 10% of the training data. During the training process, the parameters of the transformer model, as well as the parameters of the subsequent layers, were updated. The models were evaluated while training using an evaluation set that had one fifth of the rows in training data. We performed early stopping if the evaluation loss did not improve over three evaluation steps. All the models were trained for three epochs.

4.5 Offensive language detection

The performance on discriminating between offensive (OFF) and non-offensive (NOT) posts is reported in Table 3. We can see that all models perform better than the majority baseline. As expected, transformer-based models outperform other machine learning models. From the transformer models, IndicBERT model (Kakwani et al. 2020) outperforms general multilingual transformer models such as mBERT Devlin et al. (2019) and xlm-roberta-large (Conneau et al. (2019)) providing 0.85 Macro-F1 score on the test set.

Table 3 Results for offensive language detection (level A)
Table 4 Results for offensive language categorization (level B)

4.6 Categorization of offensive language

In this set of experiments, the models were trained to discriminate between targeted insults and threats (TIN) and untargeted (UNT) offenses. The performance of various machine learning models on this task is shown in Table 4. Similar to level A, transformer models outperformed other machine learning models in this set of experiments too. Furthermore, IndicBERT model (Kakwani et al. 2020) performs best from the transformer model with providing 0.74 Macro-F1 score.

4.7 Offensive language target identification

Here, the models were trained to distinguish between three targets: a group (GRP), an individual (IND), or others (OTH). In Table 5, we can see that all the models achieved similar results, far surpassing the random baselines, with a slight performance edge for the transformer models. Similar to the previous levels, IndicBERT performed best in this level too providing 0.65 Macro-F1 score.

Table 5 Results for offense target identification (level C)

5 Transfer-learning experiments

The main idea of the methodology is that we train a classification model on a resource-rich, typically English, using a cross-lingual language model, save the weights of the model and when we initialize the training process for Marathi, start with the saved weights from English. Previous work has shown that a similar transfer learning approach can improve the results for Arabic, Greek and Hindi (Ranasinghe and Zampieri 2020, 2021; Ranasinghe and Zampieri 2021). We only experimented transfer-learning experiments with the transformer models as they provided better results than other embedding models in Sect. 4.

We first trained a transformer-based classification model on a resource-rich language. We used different resource-rich languages for each level which we describe in the following sections. Then we save the weights of the transformer model as well as the softmax layer. We use this saved weights from the resource-rich language to initialize the weights for Marathi.

5.1 Offensive language detection

For the level A, we used several datasets as the resource-rich language. As the first resource-rich language, we used English which can be considered as the language with highest resources for offensive language identification. We specifically used the OLID (Zampieri et al. 2019) level A tweets which is similar to the level A of MOLD 2.0. Also, in order to perform transfer learning from a closely related language to Marathi, we utilized a Hindi dataset used in the HASOC 2020 shared task (Mandl et al. 2020). Both the English and Hindi datasets we used for transfer learning experiments contain Twitter data making them in-domain with respect to MOLD 2.0. The results are shown in Table 6.

Table 6 Transfer learning results for offensive language identification ordered by Macro (M)-F1 for MOLD 2.0

As can be seen, the use of transfer learning substantially improved the monolingual results for mBERT and XLM-R. However, the IndicBERT model which performed best in the monolingual experiments did not improve with the transfer learning approach. We believe that this can be due to the fact that the IndicBERT model is not cross-lingual. The best cross-lingual results were shown by the XLM-R model. From the two languages that we performed transfer learning, Hindi outperformed the results obtained using the English dataset suggesting that language similarity played a positive role in transfer learning.

5.2 Categorization of offensive language

For level B, we used the OLID level B as the initial task to train the transformer-based classification model. However, as far as we know, there are no datasets equivalent to MOLD 2.0 level B in related languages to Marathi such as Hindi and Bengali. Therefore, for level B, we only used English OLID level B as the initial task. The results are shown in Table 7.

Table 7 Transfer learning results for categorization of offensive language ordered by Macro (M)-F1 for MOLD 2.0

As can be seen in the results, transfer learning improved the results for level B in XLM-R and mBERT. Similar to level A, IndicBERT performance was not improved with transfer learning. XLM-R with transfer learning provided the best results with 0.75 Macro-F1 score.

5.3 Offensive language target identification

As there are no equivalent datasets similar to level C in MOLD 2.0 in related languages, we only used OLID level C as the initial dataset.

Table 8 Transfer learning results for offensive language target identification ordered by Macro (M)-F1 for MOLD 2.0

As can be seen in Table 8 transfer learning improved the results of XLM-R and mBERT. However, in this level too, transfer learning did not improve the performance of IndicBERT. Overall, XLM-R with transfer learning provided the best result with 0.74 Macro-F1 score.

6 SeMOLD: semi-supervised data augmentation

For the semi-supervised experiments, we collected additional 8,000 Marathi, using the same methods described in Sect. 3. Rather than labeling them manually we followed a semi-supervised approach described to annotate SOLID Rosenthal et al. (2021). We first selected the three best machine learning classifiers we had from Sect. 4: mBERT, XLM-R and IndicBERT. Then, for each instance in the larger dataset, we saved the labels from each machine learning model. We release this larger dataset as SeMOLD: Semi-supervised Marathi Offensive Language Dataset. We use filtered SeMOLD instances to augment the training set. We only performed the data augmentation experiments for the transformer models.

6.1 Offensive language detection

In the data augmentation process for level A, we augmented instances from SeMOLD, where at least two machine learning models predicted the same class in level A. For the level A, as can be seen in Table 9, when training with MOLD+SeMOLD, the results did not improve for the transformer models.

Table 9 Semi-supervised data augmentation results for offensive language identification in MOLD 2.0

This is similar to the previous experiments in data augmentation (Rosenthal et al. 2021) where the results do not improve when the machine learning classifier is already strong. We can assume that the transformer models are already well trained for MOLD and adding further instances to the training process would not improve the results for the transformer models.

6.2 Categorization of offensive language

For the level B, the MOLD training set is smaller, and the task is also more complex than the level A. Therefore, the machine learning models can benefit from adding more data. As can be seen in Table 10, all of the transformer models improve with data augmentation from SeMOLD instances. IndicBERT model performed best with the data augmentation and provided 0.76 Macro-F1 score.

Table 10 Semi-supervised data augmentation results for categorization of offensive language in MOLD 2.0

6.3 Offensive language target identification

Finally for level C, the manually annotated OLID dataset is even smaller, and the number of classes increases from two to three. As can be seen in Table 11, all the models improve with the data augmentation process. IndicBERT model performed best after the data augmentation process scoring 0.68 Macro-F1 score.

Table 11 Semi-supervised data augmentation results for offensive language target identification in MOLD 2.0

7 Conclusion and future work

We presented a comprehensive evaluation of Marathi offensive language identification along with two new resources: MOLD 2.0 and SeMOLD. MOLD 2.0 contains over 3600 tweets annotated with OLID’s three-level annotation taxonomy making it the largest manually annotated Marathi offensive language dataset to date. SeMOLD is a larger dataset of 8,000 instances annotated with semi-supervised methods. Both these results open exciting new avenues for research on Marathi and other low-resource languages.

Our results show that it is possible to identify types and targets of offensive posts in Marathi with a relatively small size dataset (answering RQ1). With respect to RQ2, we report that (2) the use of the larger dataset (SeMOLD) combined with MOLD 2.0 results in performance improvement particularly for levels B and C where less data are available in MOLD (answering RQ2.1); and (2) transfer learning techniques from both English and Hindi result in performance improvement for Marathi in the three tasks (identification, categorization, and target identification) (answering RQ2.2). We believe that these results shed light on offensive language identification applied to Marathi and low-resource languages as well, particular Indo-Aryan languages.

In future work, we would like to extend MOLD’s annotation to a fine-grained token-level annotation. This would allow us to jointly model both instance label and token annotation as in MUDES (Ranasinghe et al. 2021). Finally, we would like to use the knowledge and data obtained with our work on Marathi and expand it to closely related Indo-Aryan languages such as Konkani.