Keywords

1 Introduction

Social media is undoubtedly one of the greatest innovations of all time. From connecting with people across the globe to sharing of information and knowledge in a minuscule of a second, social media platforms have tremendously changed the way of our lives. This is accompanied by an ever-increasing usage of social media, cheaper smartphones, and the ease of internet access, which have further paved the way for the massive growth of social media. To put this into numbers, as per a recent reportFootnote 1, more than 4 billion people around the world now use social media each month, and an average of nearly 2 million new users are joining them every day.

While social media platforms have allowed us to connect with others and strengthen relationships in ways that were not possible before, sadly, they have also become the default forums for holding high-stakes conversations, blasting polarizing opinions, and making statements with little regard for those within the screenshot. The recent increase in online toxicity instances has given rise to the dire need for adequate and appropriate guidelines to prevent and curb such activities. The foremost task in neutralising them is hostile post detection. So far, many works have been carried out to address the issue in English [18, 28] and several other languages [2, 16]. Although Hindi is the third largest language in terms of speakers and has a significant presence on social media platforms, considerable research on hate speech or fake content is still quite hard to find. A survey of the literature suggests a few works related to hostile post detection in Hindi, such as [9, 25]; however, these works are either limited by inadequate number of samples, or restricted to a specific hostility domain.

A comprehensive approach for hostile language detection on hostile posts, written in Devanagari script, is presented in [1], where the authors have emphasized multi-dimensional hostility detection and have released the dataset as a shared task in Constraint-2021 Workshop. This paper presents a transfer learning based approach to detect Hostile content in Hindi leveraging Pre-trained models, with our experiments based on this dataset. The experiments are subdivided into two tasks, Coarse Grained task: Hostile vs. Non-Hostile Classification and Fine Grained subtasks: Sub-categorization of Hostile posts into fake, hate, defamation, and offensive.

Our contribution comprises of improvements upon the baseline in the following ways:

  1. 1.

    We fine-tuned transformer based pre-trained, Hindi Language Models for domain-specific contextual embeddings, which are further used in Classification Tasks.

  2. 2.

    We incorporate the fine-tuned hostile vs. non-hostile detection model as an auxiliary model, and fuse it with the features of specific subcategory models (pre-trained models) of hostility category, with further fine-tuning.

Apart from this, we have also presented a comparative analysis of various approaches we have experimented on, using the dataset. The code and trained models are available at this https urlFootnote 2.

2 Related Work

In this section, we discuss some relevant work in NLP for Pre-Trained Model based Text Classification and Hostile Post Detection, particularly in the Indian Languages.

Pretrained-Language Models in Text Classification

Pre-trained transformers serve as general language understanding models that can be used in a wide variety of downstream NLP tasks. Several transformer-based language models such as GPT [23], BERT [5], RoBERTa [14], etc. have been proposed. Pre-trained contextualized vector representations of words, learned from vast amounts of text data have shown promising results in the task of text classification. Transfer learning from these models has proven to be particularly useful in tasks where there is a lack of undisputed labeled data and the inability of surface features to capture the subtle semantics in the text as in the case of hate speech [15]. However, all these pre-trained models require large amounts of monolingual corpus to train on. Nonetheless, Indic-NLP [11] and Indic-Transformers [8] have curated datasets, trained embeddings, and created benchmarks for classification in multiple Indian languages including hindi. [10] presented a comparative study of various classification techniques for Hindi, where they have demonstrated the effectiveness of Pre-trained sentence embedding in classification tasks.

Hostile Post Detection

Researchers have been studying hate speech on social media platforms such as Twitter [29], Reddit [17], and YouTube [19] in the past few years. Furthermore, researchers have recently focused on the bias derived from the hate speech training datasets [3]. Among other notable works on hostility detection, Davidson et al. [4] studied the hate speech detection for English. They argued that some words might reflect hate in one region; however, the same word can be used as a frequent slang term. For example, in English, the term ‘dog’ does not reveal any hate or offense, but in Hindi (ku##a) is commonly referred to as a derogatory term in Hindi. Considering the severity of the problem, some efforts have been made in Non-English languages as well [2, 7, 16, 25]. Bhardwaj et al. [1] proposed a multi-dimensional hostility detection dataset in Hindi which we have focused on, in our experiments. Apart from this, there are also a few attempts at Hindi-English code-mixed hate speech [26].

3 Methodology

In the following subsections, we briefly discuss the various methodologies used in our experiments. Each subsection describes an independent approach used for classification and sub-classification tasks. Our final approach is discussed in Sect. 3.4.

3.1 Single Model Multi-label Classification

In this approach, we treat the problem as a Multi-label classification task. We use a single model with shared parameters for all classes to capture correlations amongst them. We fine tuned the pre-trained BERT transformer model to get contextualized embedding or representation by using attention mechanism. We experimented with three different versions of pre-trained BERT transformer blocks, namely Hindi BERT (a compressed form of BERT) [6], Indic BERT(based on the ALBERT architecture) [11], and a HindiBERTa model [24]. The loss function used in this approach can be formulated mathematically as:

$$\begin{aligned} L(\hat{y},y) = - \sum _{j=1}^{c}y_{j}log\hat{y_j} + (1 - y_j)log(1-\hat{y_j}) \end{aligned}$$
$$\begin{aligned} J(W^{(1)}, b^{(1)}, ...) = 1/m \sum _{i = 1}^{m} L(\hat{y^{i}}, y^{(i)}) \end{aligned}$$

where, c is total number of training examples and m is number of different classes (i.e. non-hostile, fake, hate, defamation, offensive).

3.2 Multi-task Classification

In this approach, we considered the classification tasks as a Multi-task Classification problem. As described in Fig. 1(a), we use a shared BERT model and individual classifier layers, trained jointly with heuristic loss. This is done so as to capture correlations between tasks and subtasks in terms of contextualized embeddings from shared BERT model while maintaining independence in classification tasks. We experimented with Indic-BERT and HindiBERTa (we dropped the Hindi BERT model in this approach as the performance was poor compared to the other two models because of shallow architecture). The heuristic loss can be formulated mathematically as:

$$\begin{aligned} L = l(x, y) = { \{ l_1, ..., l_N\}}^T \end{aligned}$$

where,

$$\begin{aligned} l_n = -w_n [y_n \cdot log\sigma (x_n) + (1- y_n) \cdot log(1- \sigma (x_n))] \end{aligned}$$
$$\begin{aligned} L_{total} = L_{(hostile/non-hostile)}+ \lambda \cdot 1/N \{L_{(hurt, defame, fake, offensive)}\} \end{aligned}$$

if post is Hostile \(\lambda = 0.5 \) (contributing to fine grain task ), otherwise \(\lambda = 0\)

3.3 Binary Classification

Unlike the previous two approaches, here we consider each classification task as an individual binary classification problem based on fine tuned contextualised embedding. We fine tuned the BERT transformer block and the classifier layer above it using the binary target labels for individual classes. Same as in Multi-task approach, we experimented this approach with Indic-BERT and HindiBERTa. Binary cross-entropy loss used in this approach can be mathematically formulated as follows:

$$\begin{aligned} L_{i}(\hat{y},y) = - \sum _{j=1}^{c}y_{j}log\hat{y_j} + (1 - y_j)log(1-\hat{y_j}) \end{aligned}$$

where, c is total number of training examples and i is number of independent models for each task

3.4 Auxiliary Task Based Binary Sub-classification

Similar to the previous approach, each classification task is considered as an individual binary classification problem. However, as an improvement over the previous approach, we treat the coarse-grained task as an Auxiliary task and then fuse its logits to each of the fine-grained subtasks. The motivation is that a hostile sub-class specific information shall be present in a post only if the post belongs to hostile class [12]. So, treating it as an Auxiliary task allow us to exploit additional hostile class-specific information from the logits of Auxiliary model. The loss function used in this case was same as described in Binary Classification. The model is described in Fig. 1(b).

Fig. 1.
figure 1

(a) Multi-task classification model (b) Auxiliary task based binary sub classification model.

4 Experiment

In this section, we first introduce the dataset used and then provide implementation details of our experiments in their respective subsections.

4.1 Dataset Description

As already mentioned in Sect. 1, we evaluate our approach based on the dataset proposed in [1]. As described in the dataset paper, the objective of the task is a classification of posts as Hostile and Non-Hostile and further Multi-label classification of Hostile posts into fake, hate, offensive, and defame classes. The dataset consists of 8192 online posts out of which 4358 samples belong to the non-hostile category, while the rest 3834 posts convey one or more hostile dimensions. There are 1638, 1132, 1071, and 810 posts for fake, hate, offensive, and defame classes in the annotated dataset, respectively. Same as in the paper [1], we split the dataset into 70:10:20 for train, validation, and test, by ensuring the uniform label distribution among the three sets, respectively.

4.2 Pre-processing

Prior to training models, we perform the following pre-processing steps:

  • We remove all non-alphanumeric characters except full stop punctuation marks (\(\vert \), ?) in Hindi, but we keep all stop words because our model trains the sequence of words in a text directly.

  • We replace all user mentions and hashtags with a blank space.

  • We skip emojis, emoticons, flags etc. from the posts.

  • We replace the URLs with the string ‘http’.

4.3 Experimental Setup

All the experiments were performed using Pytorch [20] and HuggingFace [30] Transformers library. As the implementation environment, we used Google Colaboratory tool which is a free research tool with a Tesla K80 GPU and 12 GB RAM. Optimization was done using Adam [13] with a learning rate of \(1e{-}5\). As discussed earlier in Sect. 3, in our experiments, we used pre-trained HindiBert [6], IndicBert [11] and HindiBERTa [24] Models available in HuggingFace library. Input sentences were tokenized using respective tokenizers for each model, with maximum sequence length restricted to 200 tokens. We trained each classifier model with a batch size of 16. In all the approaches, we used only the first token output provided by each model as input to classifier layer. Each classifier layer has 1 dropout layer with dropout of 0.3 and 1 fully connected layer. Each sub-classification task (fine grained task) was trained only on the hostile labeled examples, i.e. the posts that had at least one label of hostile class, so as to avoid extreme class-imbalance caused by including non-hostile examples. For the evaluation, we have used weighted f1 score [22] as a metric for measuring the performance in both the classification tasks. As suggested in the CONSTRAINT-2021 shared task [21], to measure the combined performance of 4 individual fine-grained sub-tasks together, we have used weighted fine-grained f1 score as the metric, where the weights for the scores of individual classes are the fraction of their positive examples.

5 Results

Table 1. Results obtained using various methods and models used. Here, Baseline: as described in the dataset paper [1], MLC: Multi Label Classification, MTL: Multitask Learning, BC: Binary Classification and AUX: Auxiliary Model

In this section, we discuss the results from the different approaches proposed in Sect. 3. Table 1 summarizes the obtained results for different approaches, along with the baseline [1]. Since hostile/non-hostile posts are real phenomenon, we did not perform oversampling and undersampling techniques to adjust class distribution and tried to supply the dataset as realistic as possible. This was done to avoid overfitting (in case of oversampling) and the loss of crucial data (in case of undersampling). As it’s clear from Table 1, our best model based on approach described in Sect. 3.4 with Indic-BERT model outperforms the baseline as well as other approaches in both the tasks, i.e. Coarse Grained Task of Hostile vs. Non-Hostile Classification and Fine Grained Task of Hostile Sub-Classification. Moreover, our best model stands as the 3\(^{\mathbf{rd}}\) runner up in terms of Weighted fine grained f1 score in the CONSTRAINT-2021 shared task on Hostile Post detection (Results can be viewed hereFootnote 3).

6 Error Analysis

Although we have received some interesting results, there are certain dimensions where our approach does not perform as expected. Through this section we try to better understand the obtained f1 scores through some general observations and some specific examples (refer Table 2). Our model did perform comparatively better in fake dimension which implies the model was able to capture patterns in fake samples from dataset to a large extent. However, as can be seen in the example 1, the fake/non-fake classification of posts in certain cases largely context/knowledge based. Therefore, in absence of any external knowledge, the method is quite inefficient, particularly in those kind of samples which are under-represented in the dataset. Apart from this, we observe that the defamation scores are the lowest in general. This could be mainly attributed to the overall under-representation of the class in the dataset. Hence a more balanced dataset is critical to boost the defamation f1 score.

Table 2. Misclassified samples from the dataset

Another important observation to note is the existence of metaphorical data in the dataset, which implies meaning different from what semantic information is absent. For example, consider example 2 in the Table 2. This tweet has been inspired by the Hindi idiom which means a person after committing every sin in the rule book looks to God for atonement and is used to refer to a hypocritical person indirectly. Such examples lead to mis-classification by models which are primarily based on contextualized embeddings training on simple datasets, as in our case. However, this could be eliminated if the models are pre-trained/fine-tuned on datasets which contain more such examples of metaphorical samples. From our manual inspection, we also observed that the dataset includes some examples, the labels of which are not even apparent to us. For instance, consider example 4. This example simply urges people to speak up and for some cause. Such type of sentence are quite often noticeable in hindi literature. It is impossible to conclude that it is an offensive post with the given data. However, the fact that it is correctly classified by our model reflects bias in the dataset with respect to certain kind of examples, against a generalization of the “Offensive” dimension. Apart from this, we also found some examples which, in our opinion are labeled incorrectly or are possibly ambiguous to be categorised in dimensions being considered. Example 5 means we do not want a favour we only ask for what we deserve which is labeled as defamation however according to us, it is ambiguous to classify it into any of the considered dimensions and largely dependent on the context. Similarly in example 6, someone is being referred as which means a dog, according to us it should be hate but is not labeled as hate.

7 Conclusion and Future Work

In this paper, we have presented a transfer learning based approach leveraging the pre-trained language models, for Multi-dimensional Hostile post detection. As the evaluation results indicate, our final approach outperforms baseline, by a significant margin in all dimensions. Furthermore, examining the results shows the ability of our model to detect some biases and ambiguities in the process of collecting or annotating dataset.

There is a lot of scope of improvement for fine Grained with few positive labels. Pre-training on relevant data (such as offensive or hate speech) is a promising direction. In case of Fake news detection, it is very difficult to verify the claim without the use of external knowledge. In future, we would like to extend the approach purposed in paper [27], by using processed-wikipedia knowledge it is possible to significantly improve fake news detection accuracy.