1 Introduction

With the rise of social media, identifying toxic, aggressive, and offensive language has gained much attention. On one side, the internet has provided enormous opportunities to enhance interaction and bring awareness of recent activities occurring within or beyond the country. On the other side, perceived anonymity and lack of social cues encourage bullying and offensive incidents, resulting in more harm to a person’s life and depression. According to [9], the most common forms of offensive and aggressive text use hate remarks to incite religious, racial, ethnic, or political animosity on the internet. Also, threatening posts and inappropriate comments related to physical appearance are rising these days. Hence, to find out regions highly affected by such events, this paper presents the following survey reports.

According to the Microsoft survey about global youth online behaviour, offensive and cyberbullying posts are rising in most countries, but India ranked third after China and SingaporeFootnote 1. The dream of ‘Digital India,’ i has exponentially risen to such events. As per the statistical study conducted by Symantec, approximately 8 out of 10 individuals are subject to the different types of cyberbullying or cyber harassment in India. Out of these, around 63% faced online abuses and insults in the form of aggressive and offensive comments, and 59% were subject to false rumours and gossip for degrading their imageFootnote 2. The same study ranks India as the country facing the highest cyberbullying in the Asia Pacific, more than Australia and Japan. In fact, as per the survey by feminismindiaFootnote 3, nearly 50% of women located in prime locations of India have been targeted for online abuse. As a result, hostile and offensive events are rising in India. In addition to it, since India is rich in language diversity, most of the hostile content is written using bilingual scripts. Of various bilingual languages, Hi-En (Hindi English) code-mixed language content is rising on the internet since Hindi is a common regional language and English is a widely spoken international language. Therefore, an intelligent system is required to detect aggressive and offensive content written in this language. Further, this paper has elaborated more about the language and its challenges while identifying offensive content with the help of some examples.

1.1 What is Hi-En code-mixed language?

India is the land of many languages. Hindi is the most widely used regional language as the first spoken language and English as the second spoken languageFootnote 4. In a multilingual environment, people borrow words from a second language while writing comments in the first language. This phenomenon is called code-mixing [31]. For instance, Hi-En code -mixed text: @Feminism Is CANCER Ekdam Sahi bola Bhai. GLOSS: @Feminism Is CANCER, correctly said, brother.

Here, ‘Feminism Is CANCER’ is in English, and ‘Ekdam Sahi bola Bhai’ is a Hindi phrase written in romanized script. It is an informal language where regional Hindi speakers write a Hindi comment in a romanized script and mix it with English words or phrasesFootnote 5. The romanized script is also known as Hinglish. It is a type of Hindi phonetic writing using English alphabets. Hence, it suffers from the problem of spelling variation—for instance, bhai, bhaai, bhaii varies in spelling but the semantic meaning of all the words refers to brother only. Further, this paper explained its challenges in text understanding perspective.

1.2 What is Hi-En code-mixed language?

Insulting and offensive comments written in Hi-En code-mixed language are aggressive Hi- En code-mixed text. Table 1 is a glimpse of the TRAC 2-2020 aggression classification dataset, and its purpose here is to show the raw code-mixed text, Hinglish text, spelling variation, and its target class.

Table 1 Glimpse of TRAC 2-2020, Trolling, Aggression, and Cyberbullying in Hi-En code mixed text [6]

The dataset is classified in one of the three categories as CAG (Covertly aggressive), NAG (non-aggressive), and OAG (Overtly aggressive).OAG is annotated if the text is abusive or insulting the target directly, and CAG is annotated for the sentence which is inducing hate in the community without targeting particular. NAG is annotated for non-aggressive sentences. It is to note here that even though some sentences contain insulting and hostile words (e.g., dange, bhadakte, kadwa), they belong to the NAG class. It shows that only isolated words are not enough to classify a complex sentence. It requires contextual knowledge as well for correct inference. In addition, it is more challenging to distinguish between aggressive comments and freedom of speech [6].

From sizeable social media content, identifying such aggressive content manually is a challenging and time-consuming task. As a result, an automated system is needed to recognize such content automatically. Recently, machine learning and deep learning algorithms have been widely used to build advanced, automated, and intelligent systems. The efficiency of these algorithms is based on how well the features and resources are defined. Although, in literature, several resources and methods have been proposed to extract features in low resource languages.Yilmaz and Toklu [41] has built word embedding in the Turkish language for question classification tasks. Hassan et al., [14] has generated word embedding using word2vec in the Urdu language for part of speech tagging and sentiment analysis. Athavale et al., [1] has used Hindi word2vec word embedding for named entity recognition in the Hindi language. Prominent lexical resources like WordNet, Hindi senti-WordNet, indo-WordNet have been used by [4, 10] to extract polarity based sentiment features for sentiment classification in language Hindi. As discussed, extensive work has been done for feature extraction in low-resource language. However, these resources are limited to monolingual texts like Hindi, Punjabi, Urdu, Turkish. Consequently, the non-availability of such resources for informal language like Hindi English code- mixed poses a challenge while using advanced machine learning algorithms. Hence, the proposed method take advantage of both word and character embedding to build code-mixed hybrid embedding(CMHE) in order to overcome the challenge of Out Of Vocabulary (OOV) words. In addition, CMHE aims to initialize the network with discriminative feature based on word polarity. Further, these features are precisely tuned using supervised learning for sentence classification

The significant contributions of this paper are summarized below:

  • Proposed Code-Mixed Hybrid Embedding (CMHE) capable of capturing contextually related words and words with spelling variation. It allows the n/w to initialize with discriminative features based on word polarity and reduces OOV words as well.

  • To leverage the benefits of Hindi English code-mixed language, this paper proposed a Code-Mixed Hybrid Embedding based Attention Network (CMHE-AN). In this network, the target model is regularized with CMHE and extracted relevant features using the self-attention mechanism.

  • Efficiency of the proposed model is demonstrated by comparing the proposed method against the existing state-of-the-art.

  • Adopted ablative studies to measure the influence of each component of the proposed model.

Further, this paper is arranged as follows. Section 2 provides a brief overview of the relevant background literature. The proposed framework is detailed in Section 3. In Section 4, the datasets examined for the experiments, Hyperparameter selection, comparative results, and ablative studies are discussed. Conclusion and future work are drawn in final section.

2 Related work

Most of the existing work related to the detection of offensive language is concentrated on English only. However, a few works have been done for Hindi English code-mixed language. Therefore, this paper includes analysis of existing work related to offensive language detection for both English and Hindi English code-mixed languages. A detailed comparable work is presented, and it has been categorized into statistical and linguistic features, and Deep learning-based features.

2.1 Statistical and linguistic features

A considerable amount of work has been done using statistical features such as unigram and bigram along with tf-idf weights at a word and character level. one of the works by [34] has explored statistical features like char n-gram, word n-gram with logistic regression, and multinomial naïve Bayes. They observed that the combination of character n-gram along with word unigram and logistic regression performed well for Hindi English code-mixed data. However, they used pretrained word2vec embedding for the English dataset. Also, they discussed the non-availability of code-mixed Hindi English pretrained embedding and its need in their future work. Datta et al., [11] used statistical feature as tf-idf (term frequency-inverse document frequency) and linguistic feature as emoji, part of speech, sentiment score and evaluated the performance using SVM (support vector machine), GBM (gradient boosting), XGB (xtreme Gradient boosting), and voting classifier. Consequently, out of all, the voting classifier had performed best in terms of accuracy on an English dataset. Mandal et al., [27] focused on preprocessing code mixed text; hence, they applied normalization techniques with the help of the sequence to sequence model and levanshtine distance onto English Bengali code mixed text. Sharma et al., [37] have used the back transliteration approach for converting Hindi English code-mixed data to the Devanagari language. After this, the sentiment score is calculated using the Hindi sentiwordnet lexicon. However, back transliteration suffers from the limitation of spelling variation. It causes Out Of Vocabulary (OOV) problem as most of the abusive words, and non-standard Devanagari words are not present in the Hindi sentiwordnet lexicon. Zhao et al., [42] has proposed a novel Embeddings-enhanced Bag-of-Words Model (EBOW). The author extended the bully word using a word embedding and concatenated it with a tf-idf weighted Bag of words and Latent semantic features to form the vector representation and found that their model outperforms state of the art for the English language. Zhao et al., [7] concatenated linguistic feature, lexicon based on abusive words, character n-gram, and attained the best accuracy using SVM for Hindi English code-mixed.

2.2 Deep learning based features

Observing the recent rise in the usage of deep learning, [17] have built a domain-specific word embedding to detect hate speech in Hindi English code-mixed data and applied CNN (Convolutional Neural Network), LSTM (Long Short Term Memory), and Bi-LSTM (Bidirectional Long Short Term Memory) as a classifier. Here, they observed that word-level feature is the most contributing feature for detecting hate speech; however, in this work, the author has not provided any solution to accommodate OOV words. Badjatiya et al., [2] uses deep learning models such as CNN and LSTM using several embedding like random, glove, fasttext, and realized that combination of LSTM, random embedding, and gradient boost have performed best for detecting hate speech in English. Kim and Jeong [19] has used 1D CNN to analyse sentiment on movie review dataset and concluded that CNN performs better than LSTM. They concluded that local information is more contributing with fewer parameters. Kumar and Sachdeva [21] proposed multi-input integrated model using capsule network. They extracted features from English, Transliterated Hindi, and Typographic features without increasing feature dimension. Another work done by [28] has detected offensive tweets in Hindi English code-mixed language using transfer learning. In their work, they used the English hate speech labelled dataset as a source dataset. To map the different languages, they translated the code-mixed dataset to the English dataset using Hindi to English word translation and did not consider the order of words. The limitation of this approach is that the source and target dataset must have similar labels and do not consider the syntactic order of words. Santosh and Aravind [35] used phonic subword embedding to identify hate speech in Hi-En code-mixed data. Paul et al., [32] concatenated Hindi and English pretrained word2vec and experimented with ensemble of BERT, Multilayer perceptron, CNN, LSTM and realized that regression based ensemble has outperformed for code switched cyberbullying detection. Recently, [12] has released a pretrained transformer-based embedding named as Multilingual Bert (MBert). Pretrained MBert is based on BERT architecture, and it is trained on 104 languages in parallel. It has been widely used for applications based on multilingualism. Pires et al., [33] has concluded that MBert significantly performs well for languages that have similar word order sequences. Sharma et al., [36] have preprocessed and converted the Hinglish text into Devanagari text with the help of language identification technique and levanshtine distance. Furthermore, they used muril representation [23] framework for classification. Here, since levanshtine is a rule-based method and does not consider context, conversion of multiple spelling in single Devanagari script is a challenging task. Kumari et al., [22] has used 3 layers of LSTM autoencoder in stacked manner using random embedding and classify aggressive content using reconstruction loss by the autoencoder. Koufakou et al., [20] augmented English language dataset and code mixed dataset to increase size of data and used fasttext and LSTM for classification. However, they observed that fasttext is not sufficient to understand the code mixed language.

As discussed, considerable work has been done in detecting hate speech and the cyber aggression domain. The majority of them are concentrated on the English language; however, less work has been focused on non-English language. Hence, an urgent need for a system is realized which can automatically detect cyber aggressive and offensive comments in Hindi English code-mixed user-generated informal content.

3 Proposed methodology

As shown in Fig. 1, a code-mixed hybrid embedding based attention network (CMHE-AN) is developed to detect the level of cyber aggression in Hindi English code-mixed social media text. It is a supervised classification model which takes the benefit of domain-specific knowledge from an unsupervised corpus. Hence, CMHE-AN Frameworks is divided into two stages such as pretraining stage and the training stage. At stage I: code-mixed hybrid embedding (CMHE) is proposed to gather the contextual and morphological knowledge from the large unlabelled code-mixed corpus. It aims to assign a similar weight vector to the relative words having semantic relation (e.g., surrender, surgical, strike) as well as to the words that are misspelled or have spelling variation (e.g., sarkar, srkar, srkaar) (gloss: government). In addition, it represents hybridization of word and character embedding along with reduction of OOV words. At stage II: An attention-based framework is proposed to finetune the knowledge concerning classification tasks.

Fig. 1
figure 1

Initialize Embedding layer of CMHE-AN model with CMHE

3.1 Construction of code-mixed hybrid embedding (Stage I-Pretraining)

This stage involves the process of collection and preprocessing of the large unlabelled corpus. Further, it demonstrates the process of building code mixed hybrid embedding (CMHE) using word and character embedding along with reducing OOV words.

3.1.1 Data collection

Popular pretrained word2vec for language English is trained on sizeable unsupervised corpus in self-supervised fashion and has shown promising results in the past few works. Inspired by this, a large number of tweets written in Hinglish and code-mixed Hindi English language were collected using a seed word dictionary as shown in Fig. 2. In order to create a seed word dictionary, 102 insulting and obscene words provided by [38] were used as seed words such as haraami, haraamzaada, soovar, andhe, hijda, kutta etc. To get maximum tweets in Hinglish, these words were used in Hinglish form. In order to avoid bias, we added positive emotional expressive words. We have added 12 positive emotional words in the seed dictionary such as honest, great, friendly, innocent, careful, motivate, nice, wonderful, amazing, correct, faith, thank you. These words are most commonly used while switching in Hindi and English. For instance, Thank you mere video ka honest review dene k liye.

Fig. 2
figure 2

Creation of seed word dictionary

Finally, it has expanded seed word dictionary to 114 words. After this, seed word dictionary and GetOldTweets3 APIFootnote 6 were used to collect tweets from Twitter. As shown in Table 2, 135000 tweets having 2799402 words in total and 209093 words as vocabulary size were collected as unsupervised corpus. Further, this collected unsupervised corpus is preprocessed using steps mentioned in 3.1.2.

Table 2 Characteristics of unsupervised corpus to build code mixed hybrid embedding

3.1.2 Preprocessing steps

Collected unsupervised corpus from Section 3.1.1 consists of user-generated text; it contains a lot of unwanted and noisy data, irrelevant for classification. Therefore, following steps were followed to remove the irrelevant text from the collected corpus and training datasets.

  • As shown in Table 3, Devanagari script is transliterated to English script, using the indictrans transliteration libraryFootnote 7 [5].

  • All the URLs and numeric were removed using regular expressions.

  • All the emoticons were removed as it was adding false information.

  • All words were reduced to lowercase, e.g., GHATIA, Ghatia to ghatia, (GLOSS: worst) to map with same word embedding vector.

  • All the @(e.g., @ xyz) mentioned reduced to common term as the user.

  • Stop word was not removed as it can lose the grammatical flow of language.

  • Expression of the scream was mentioned using elongated words. Such elongated words have reduced to its normal form, e.g., nahiiiiiiiiiii to nahi.

Table 3 Transliteration from Devanagari to English

Further, this preprocessed unsupervised corpus will be termed as UCagg. It is used to build aggressive word embedding and aggressive character embedding, as explained in upcoming Sections 3.1.3 and 3.1.4.

3.1.3 Aggressive word embedding (a w e)

Firstly, an aggressive word embedding using the word2vec algorithm is constructed. Pretrained word2vec has been widely used for semantic-based text representation. It is based on the distributed hypothesis that words located close to each other in the embedded space preserve semantic similarities. Word2vec pretrained model is available for languages like English, Hindi; however, the same is not available for Hi-En code-mixed social media text. Hence, in order to incorporate a contextually related embedding, an aggressive word embedding is built by training an unsupervised corpus (UCagg) using the word2vec algorithm in a self-supervised fashion. Further, this section explains about its training using UCagg along with training parameters. UCagg was trained using a continuous bag of the word (CBOW), a shallow neural network architecture of word2vec proposed by [29] with vector dimension as 100 and window size of 5 words to obtain aggressive word embedding. Further, in this paper, it will be termed as aggressive word embedding (awe). Moreover, to analyse the efficiency of (awe), cosine similarity score is used to visualize the context relative words using (1). Table 4 represents the top 10 similar words captured by awe concerning the target word(t) using a cosine similarity score. From Table 4, it can be inferred that it has successfully captured the contextually related words.It is also observed that same polarity words are similar in weights as calculated using cosine similarity, which supports n/w to understand polarity based feature without use of explicit polarity dictionary. For instance, ghatiya is used to express negative emotion and other words in its closed proximity are recognized correctly irrespective of any spelling variation like neech, nich, beshrm, gatiya, besharam are most similar words to the word ghatiya. Hence, these words share the same embedding space and convey similar meaning in language understanding perspective.

$$ Cosine Similarity (t,b)= \frac{{{\sum}_{i=1}^{n}}t_{i}b_{i}}{\sqrt{{{\sum}_{i=1}^{n}}{t_{i}^{2}}}\sqrt{{{\sum}_{i=1}^{n}}{b_{i}^{2}}}} $$
(1)

Here, t is target word, and b is the vocabulary words of Uagg.

Table 4 Aggressive word embedding (awe) captured relative words

3.1.4 Aggressive character embedding (a c e)

Fasttext, character n-gram embedding, is used to build aggressive character embedding. The approach of fasttext is similar to word2vec, but in place of the whole word, it considers the group of n characters as the word. The resultant vector of the word is calculated by taking an average of n-gram character embedding. Due to this, it can capture words with spelling and morphological variation. UCagg was trained using the skip-gram model of fasttext for dimension 100 and n-gram length as 3 characters as suggested by [8]. Further, in this paper, it will be termed as aggressive character embedding (ace). In Table 5, most 10 similar words captured concerning the target word(t) using ace have been represented along with its cosine similarity score. From Table 5, it can be inferred that it has successfully captured the morphological variation of words.

Table 5 Aggressive character embedding(ace) captured spelling variations

Further, (awe) and (ace) were used to build code mixed hybrid embedding (CMHE) along with a reduction in OOV simultaneously.

3.1.5 Code-Mixed Hybrid Embedding (CMHE)

CMHE is specifically based on concatenation of aggressive word(awe) and aggressive character embedding(ace) along with a reduction in OOV words. It used the concatenation approach to capture both polarity based contextually related words from awe and morphologically related words from ace. Since ace is based on n-gram words, it does not have any OOV issue, but awe suffers from the problem of OOV. Hence, in CMHE, we took advantage of (ace) and its power to capture variously spelled similar words to reduce OOV present in awe. In addition to it, Table 5 depict that contextually related positive polarity words have similar embedding weights hence, it conveys positive sentiments in language understanding phase. Similarly, negative polarity contextual words are similar in weight vector and present in closed embedding space, which allows to capture negative polarity at the word level. Overall, the proposed network is intended to be initialized with discriminative features based on word polarity and in later phase, these will be précised using supervised network for categorizing the sentences. Further, this paper explains the CMHE with help of an algorithm.

In given Algorithm 1, knowledge from the unsupervised corpus in the form of (awe) and (ace) will be transferred to build CMHE(axe), specifically for vocabulary associated with the experimental dataset. In other words, axe for word wi ∈ vocabulary(Vd), each having dimension 200, is built by transferring concatenated weights from (awe) and (ace), as shown in Fig. 1 and Algorithm 1. Here Vd is vocabulary associated with the experimental dataset.

Algorithm 1
figure b

OOV reduction in CMHE.

- Explanation of CMHE Algorithm (1) with an example Refer Algorithm 1, consider a word (wi) as ‘gaddar’ ∈ Vocabulary Vd. To build CMHE of word ‘gaddar’, its associated vector will be fetched from (ace) and (awe) and their concatenation will be done as shown in step 4. In case if ‘gaddar’ is not found in awe then similar words with multiple spelling will be fetched from ace and stored in an array in descending order of cosine similarity, (for detail, refer Table 5). Now, after capturing similar words in an array, each word from this array is searched in awe in descending order of cosine similarity. As soon as it found a match, then its associated vector is concatenated with (ace) which reduces the size of OOV dictionary. In case, if no similar words get matched in (awe) then a random vector of dimension 100 is assigned to word ‘gaddar’ and concatenate it with (ace). The example is explained with the help of a flowchart as shown in Fig. 3.

Fig. 3
figure 3

Working of CMHE using a sample word ‘gaddar’

After this, CMHE were used to transfer the knowledge in the word embedding layer (2nd layer) of the CMHE-AN framework, as shown in Fig. 1. Further, this paper compare the proposed approach with cross-lingual embedding MBert, finetuned Random embedding, statistical features, and their comparative analysis is shown in the result section.

3.2 Attention-based framework (Stage II)

In recent times, the usage of attention mechanisms in deep learning networks have shown a significant improvement in several applications such as machine translation, text classification, text summarization. To increase the weightage of relevant features [13, 25] have combined the attention mechanism with CNN to extract local features for the text classification task. Another well-known deep learning algorithm, LSTM, belongs to the family of Recurrent Neural Network and is proficient in extracting features from long sequences. [15, 24] combined attention mechanism with LSTM and Bi-LSTM to extract sequential features. Since a textual sentence is a form of word sequence, in which each word has a relationship with words present in its forward sequence as well as in its backward sequence; therefore, Bidirectional LSTM and the self-attention mechanism were employed to finetune code-mixed hybrid embedding (CMHE).

In all, the proposed framework CMHE-AN is focused on finetuning code-mixed hybrid embedding (CMHE) with Bi-LSTM layer and self-attention mechanism, as shown in (Fig. 4). CMHE-AN is divided into multiple layers, and a brief description of each layer is shown below:

Fig. 4
figure 4

Proposed Architecture of CMHE-AN

Input layer

It is the first layer in which the words of each sentence are converted to a sequence of unique index values xi and padded with 0 to maintain uniformity in the length of all sentences.

Embedding layer

Embedding vector of each index value xi of wi in a sentence is replaced by its real-valued vector present in proposed embedding matrix axe (CMHE) having dimension as 200 as shown in Fig. 1.

Spatial dropout layer

Inspired by [39], we employed a spatial dropout layer before Bi-LSTM layer. The role of this layer is to improve generalization performance by preventing activations from correlating strongly, which in turn leads to over-training.

Bi-LSTM layer

RNN is generally used for sequence learning, but, with its limitation of vanishing and exploding gradient, it does not perform well on long dependency tasks. With the help of an internal system of gates that governs the flow of long sequential information, LSTM has solved the limitation of RNN. It consists of 3 gates: Input gate, Output gate, and Forget gate, and its workflow is described (in Table 6) in the form of equations. All the gates use sigmoid activation as it varies between 0 and 1, which decides to forget or retain the information based on its relevance to its context. Bi-LSTM is the extension of LSTM. In order to incorporate contextual relationship between words in forward and backward sequence, Bi-LSTM is employed. Using Bi-LSTM, an annotation for each word Xt is obtained by concatenating the forward hidden state → Ht and the backward one ← Ht. In this way, the annotation Ht contains the summaries of the preceding and the following words.

Table 6 Equation of LSTM gates

Attention layer

Inspired by [3], weights of most contributing words is raised by adding a self-attention layer. It avoids a significant amount of unnecessary computation on unattended elements and allows the model to pay attention to important parts of the sequence. In this layer, the encoder maps the input sentence to a sequence of annotations (H1,… Ht). Each annotation Hi contains information about the whole input sequence, with a strong focus on the parts surrounding the i-th word of the input sequence. The context vector Ci is, then, computed as a weighted sum of these annotations Hi as shown in (2).

$$ C_{i} = \sum\limits_{j=1}^{T_{x}}\alpha_{ij}H_{j} $$
(2)

The weight αij for each annotation Hi is computed by (3).

$$ \alpha_{ij} = \frac{\exp (e_{ij})}{{\sum}_{k=1}^{T_{x}}\exp(e_{ik})} $$
(3)

Where eij is estimated by alignment of Si− 1 and Hi as shown in (4)

$$ e_{ij} = a(S_{i-1},H_{i}) $$
(4)

Here, eij is an alignment model which scores how well the inputs around position j and the output at position i matches. The score is based on the hidden state s(i− 1) and the jth annotation Hj of the input sentence.

Dropout layer

It is a regularization technique through which randomly selected neurons are ignored during training; hence during the forward pass, the contribution of these neurons to activation function become NIL and during the backward pass, there would be no updates in weight. In other words, it helps to avoid overfitting. In CMHE-AN, the dropout is set to 20%.

Dense layer1

Further, the feature passed to a fully connected layer having 32 neurons having softmax as activation function followed by a dropout of 20%.

Dense layer 2

The final features are then transferred via the fully connected layer, which calculates the probability distribution over labels before being passed through the softmax layer.

Overall, this work discussed CMHE, intending to provide relevant weights to the network. After this, a significant features extraction method using the CMHE-AN framework was explained. Further, this paper demonstrated the experimental procedure and compared its results to the baseline and existing state-of-the-art models.

4 Experiments and results

This section demonstrates experimental procedure, hyperparameter selection, ablative studies, and evaluation matrices on which results are measured.

4.1 Dataset description

For experimentation purposes, two publicly available datasets: (trolling and cyber aggression) TRAC 2-2020 Hindi English code-mixed dataset [6] and Hi-En code mixed hate speech detection [7] are used. In this paper, TRAC 2-2020 dataset is represented by dataset 1, and the Hi-En code-mixed Hate speech detection is represented by dataset 2. In dataset 1, the training dataset contains 3984, and the test dataset contains 1200 YouTube comments in Hindi English code-mixed language. Each sentence is annotated in one of 3 classes as Covertly Aggressive (CAG), Non-Aggressive (NAG), and Overtly Aggressive (OAG). Dataset 2 is a binary dataset in which each row is classified in one of the two classes as Hate and Non-Hate. It is splitted into 2 parts having 20% data as a test dataset and 80% as a training dataset using stratified sampling. In the training dataset, 1328 comments belong to hate, and 2331 belongs to Non-Hate as shown in Table 7.

Table 7 Statistical characterization of datasets

4.1.1 OOV analysis using CMHE

The total vocabulary size of datasets 1 and 2 is 12915 and 12015 words, respectively. Out of them, 3467 and 3609 words were not found in aggressive word embedding (awe); hence, they counted as actual OOV words. After applying CMHE, OOV size is reduced to 1167 and 1292 words for datasets 1 and 2, respectively. In other words, CMHE has reduced the OOV size to 33% and 35.7% of its original for datasets 1 and 2, respectively, as shown in Fig. 5. In Section 4.6, performance analysis of CMHE-AN has been demonstrated without OOV reduction and after reducing OOV words.

Fig. 5
figure 5

Actual OOV versus Reduced OOV

4.2 Experimental procedure

This section explains the procedure followed to compare CMHE-AN performance with other baseline models. Proposed model CMHE-AN is applied and evaluated for both the datasets. In baseline models, experimentation with random embedding at the character and word level is performed. In addition, its impact to extract local and sequential features using CNN and LSTM, respectively, is analysed. To interpret transformer-based embedding, M-Bert is employed and finetuned it with a fully connected dense layer. All the baseline models are evaluated, and their performance is compared with CMHE-AN on both datasets.

4.3 Comparison models

This paper compared CMHE-AN with some baseline models. A brief description of these models is mentioned below:

  • Logistic Regression: Word-based unigram and bigram along with weighted tf-idf (term frequency-inverse document frequency) features are used to create relevant feature vectors at word and phrase level. Further, logistic regression is used as a classifier.

  • Xtreme Gradient Boosting (XGBoost): It is a tree-based ensemble ML algorithm. XGBoost is used as a classifier for word-based unigram and bigram along with weighted tf-idf features.

  • Random embedding with CNN at character level (Character-CNN) [18]: Each character is randomly initialized with a small numeric vector of dimension 200 using Gaussian distribution (by default). Further, this weight is learned and finetuned using a convolutional neural network (CNN) with kernel size 3 in a supervised manner. It created local features at the character level and ignored the long sequential dependency feature of the text.

  • Random embedding with CNN at word level (Word-CNN) [18]: In this model, each vocabulary word is initialized randomly finetune using CNN using a backpropagation algorithm. It creates local features at the word level and ignores the long sequential dependency feature of the text.

  • CNN-LSTM (Subword LSTM) [16]: In this model, each n-gram character is initialized with a random numeric vector. To extract local and sequential features, character 3-gram CNN is used to create subwords and is further fed to LSTM to develop a sequential feature for long text sequences. In this model, LSTM is stacked on top of the CNN layer and further fed to a fully connected neural network layer.

  • F ine-tuned M-BERT [12]: Pretrained M-Bert is trained on 104 languages in advance. These languages comprise Hindi and English as well. To analyse its performance on Hi-En code-mixed data, M-Bert is applied on both the datasets and finetuned it with a fully connected neural network along with a dropout of 20

In the next section, evaluation matrices preferred to assess individual models for dataset 1 and dataset 2 are explained.

4.4 Evaluation metrics

This section demonstrates the evaluation matrices used to compare the performance of multiple models and help in finding out the suitable one. Accuracy: It is the ratio of the total number of correctly classified samples to the total number of data entries, as shown in (5). Judging a model based on accuracy metric will not be enough since it only concentrates on truly classified data and does not consider the effect of misclassified data.

$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN}\\ $$
(5)

Hence, to account for class imbalance, we used weighted average precision, weighted average recall and weighted average f1-score as evaluation matrices. Various work [30, 40] has used these matrices to address the challenge of data imbalance in a way that misclassification in minority class is penalized. To calculate weighted average precision, we weighted individual class precision by their support Si (number of data sample associated with ith class) and averaged them to take label imbalance into consideration as shown in (6). Here, precision is defined as the ratio of count of correctly classified samples as positive to the total predicted as positive class, as shown in (7).

$$ weighted \ average\ precision= \frac{{\sum}_{i=1}^{n}(precision_{i} * S_{i})}{{\sum}_{i=1}^{n}S_{i}} $$
(6)
$$ precision = \frac{TP}{TP+FP} $$
(7)

In a similar way, for weighted average recall, recall of each class is weighted by its class support Si and averaged as shown in (8), . Here, recall is calculated as the ratio of correctly classified sample as positive to total actual positive class samples, as given in (9).

$$ weighted \ average \ recall= \frac{{\sum}_{i=1}^{n}(recall_{i} * S_{i})}{{\sum}_{i=1}^{n}S_{i}} $$
(8)
$$ recall = \frac{TP}{TP+FN} $$
(9)

weighted average f1-score: To integrate all class specific f1-score, we weighted the f1-score of each class by their support Si(number of data sample associated with ith class) and averaged them to account for label imbalance. Mathematically, it is the product of ith class f1 score and ith class data samples count are summed up and then divided by total number of data samples present in n number of classes as shown in (10) where, f1-score is the harmonic mean of precision (7) and recall (9) as shown in (11). Also, it should be noted from (10) that weighted average f1-score do not depend on either weighted average precision or weighted average recall.

$$ weighted \ average \ f1 \ score= \frac{{\sum}_{i=1}^{n}(f1 \ score_{i} * S_{i})}{{\sum}_{i=1}^{n}S_{i}} $$
(10)
$$ f1 \ score = 2 * \frac{precision * recall}{precision + recall} $$
(11)

Here, TP= True Positive FP= False Positive TN= True Negative FN= False Negative precisioni = precision of ith class recalli = recall of ith class f1scorei = f1 score of ith class Si = number of data sample associated with ith class, Where i ∈ 1 to n

4.5 Hyperparameter selection

Initially, datasets 1 and 2 are preprocessed using steps mentioned in Section 3.1.2. After preprocessing, the text is converted to a numeric sequence using a vocabulary index. Further, padding method is used to maintain the equal length of all the sentences.In order to retain the contextual information of lengthy text, maximum length of a sentence is set 50 words for dataset 1 and 90 words for dataset 2, considering the average length of the dataset as mentioned in Table 7. In this, if a sentence length is less than the maximum length (50 for D1 and 90 for D2), prepadding is used, and if the sentence is longer than the maximum length, pruning is done at the beginning. For experiment purposes, a well-known python library, KerasFootnote 8 was used with TensorFlowFootnote 9 as a backend.5-fold cross-validation was performed on the training dataset and evaluated the final model on the test dataset. CMHE-AN was trained for eight epochs with batch size 32, which was found to be nearly optimal after several experiments. The proposed method used categorical cross-entropy as a loss function and Adam optimizer with a learning rate of 0.001 to optimize the network. However, code-mixed embedding size and LSTM units are significant and affect the networks’ performance. Thus, we have experimented with code mixed embedding dimensions as 200,400, and 600 as these are the most common dimensions used in literature. In addition, CMHE-AN was trained with the different number of LSTM units as 30,50, 60, 100,120 to extract the optimal number of LSTM units required for training. The experimental result with various hyperparameters is shown in Table 8.

Table 8 Validation Accuracy with various embedding dimensions and LSTM units size as hyperparameter of CMHE-AN

See Fig. 6 and Table 8; while training CMHE-AN with various LSTM units’ size as (30,40,60,100,120) with each embedding of dimension (200, 400,600), the following points have been observed.

  • CMHE with dimension 200 performs significantly better in terms of validation accuracy for both datasets. It is so because the size of the unsupervised corpus used to build CMHE is not as large as the English word2vec corpus. Therefore, the efficiency of the weight vector decreases as the dimension of embedding increases.

  • For dataset 1: 60 LSTM units are performing better in comparison to other LSTM units’ sizes. In dataset 2: 100 LSTM units are performing well compared to the rest of the LSTM units.

Thus, 60 LSTM units were selected for dataset 1, 100 LSTM units for dataset 2, and CMHE of dimension 200 as hyperparameter of CMHE-AN. Further, these hyperparameters were used to evaluate the Performance of CMHE-AN.

Fig. 6
figure 6

Experimental Results of CMHE-AN with various hyperparameters

4.6 Results and ablation study

The methodology discussed in Section 3 and the comparison model discussed in Section 4.3 has been used to evaluate the performance of CMHE-AN. Table 9 shows the comparison of the proposed approach with other baseline models. From the result, it is observed that: - Naïve Bayes achieved a high precision of 77.66% with a low recall of 47% for dataset 1 since the model underfit and got biased towards the majority class. - Logistic regression based on n-gram, tf-idf statistical features has scored precision of 75.78% and recall of 70% which is 11.39% and 13.72% greater than char CNN and word CNN in case of dataset 1. A similar pattern has been observed in dataset 2 as well, which signifies that random initialization of the network may mislead the meaning of the sentence. In this case, statistical feature performed better than random embedding based neural network. - Subword LSTM perform slightly better than char CNN and word CNN by 1.57% and 0.70% of f1-score because of extraction of sequential feature using LSTM with CNN layer.It has achieved the highest precision of 76.63% with low recall of 47.89% for dataset 2 since it predicted few results in minority class and most of them are correctly classified results in high precision and low recall. - Transformer-based model, finetuned MBert has performed best among other baseline models in terms of recall, accuracy, and f1-score. It has achieved a f1-score of 72% and 71% on datasets 1 and 2, respectively. Pretrained MBert is based on extracting cross-lingual features from parallel monolingual corpora. However, its performance suffers when two language lacks similar word sequencing [33]. This disruption of word sequencing phenomena occurs in Hi-En code mixed language due to repetitive switching between Hindi and English, resulting in a decrease in Mbert’s Performance. - CMHE-AN (without reducing OOV), every OOV has been assigned a random vector during concatenation of awe and ace, which in turn, achieved 75.72 and 71.87 f1-scores. It increased the performance by 3.72% and 1.87% in comparison to MBert for dataset 1 and dataset 2. It performed low for dataset 2 since the spelling variation for similar meaning words are more, results in, more OOV words. Hence, it failed to capture relevant words. - CMHE-AN (proposed) is regularized using code mixed hybrid embedding and initialized with a meaningful weight vector, resulting in comparatively better performance than other models. It has achieved the highest weighted average f1 score of 77.09 %, and 73.34%.It gained accuracy of 77.54 and 75.23 for datasets 1 and 2, respectively.It has boosted the performance by 5.09% and 2.34% f1-score against state of the art MBert. From Fig. 5, it is observed that it has reduced the OOV words by 33% and 35.7% for dataset 1 and dataset 2.Thus, the reduction in OOV has significantly increased the performance by 1.37% and 1.47% in the f1-score, which shows the impact of CMHE over other embeddings.

Table 9 Comparison results of the proposed approach with baseline models (in %)

Despite performing better than statistical feature, finetuned random feature and MBert, there is much scope left for improvement. In current work, n/w has been initialized with discriminative feature and attention based BiLSTM has been employed to finetune it to extract sequential patterns, it can be further enhanced if we associate part of speech information with each word while constructing embedding to disambiguate hate words and for feature enrichment. For instance,

  1. 1.

    the way she is eating, kutte khate h hmare yaha (offensive)

  2. 2.

    kutte india m har place par milenge, no entry restriction on them (hate inducing)

In sentence 1, the word kutte is used as a profane word against someone while in 2, it is used to induce hate against a region. Moreover, assigning part of speech to code mixed data is itself another challenge. Hence, it has been included in the future scope of the work.

Further, this paper demonstrated an ablation study to quantify the contribution of each component of CMHE-AN on dataset 1 and dataset 2.

W/O Attention layer: In this model, weight is initialized using CMHE and finetuned with Bi-LSTM and fully connected layer only. In other words, we are extracting features after omitting the attention layer from CMHE-AN.

W/O Code Mixed Hybrid Embedding: CMHE-AN is initialized with random weights and finetuned with an Attention-based framework.

Only code-mixed aggressive word embedding: CMHE-AN initialized with only code-mixed aggressive word embedding(awe). OOV size is not reduced. All OOV words are initialized randomly. After this, it is finetuned by an attention-based framework.

Only code-mixed aggressive character embedding: CMHE-AN initialized with only code-mixed aggressive character embedding(ace). In this model, words are formed using subword embedding; hence, no OOV words are found.

As shown in Table 10, performance of an individual ablative model in terms of weighted average f1 score and their comparative performance with CMHE-AN is represented by δ (delta) in (12).

$$ \delta = CMHE-AN - ablative model_{i} $$
(12)
Table 10 Ablation studies on a different component of the proposed model (CMHE-AN) in terms of weighted average f1 score

The ablative study of W/O Attention layer model reveals that the addition of the attention layer increases the weights of most contributing features and significantly improved the f1 score by 2.2% and 2% for dataset 1 and dataset 2, respectively.

W/O Code Mixed Hybrid Embedding: This model drops the result by 6.09% and 2.89%. It shows that significant initialization of the network weight is a critical step compared to random initialization of weights. While analysing the influence of code-mixed aggressive word embedding and character embedding separately, it is observed that there is a downfall of 0.42% and 0.78% while using only word embedding since it suffered from OOV issue. 3167 out of 12915 words of training dataset 1 and 4034 out of 12015 in training dataset 2 were found OOV therefore, these were initialized randomly.

Only code-mixed aggressive character embedding: This model captured words with morphological variation, but failed to capture words with relative context. Thus, each component of CMHE-AN has its own significance to classify offensive, aggressive text in Hi-En code mixed text.

4.7 Comparison with the existing state of the art

In this section, performance of the proposed approach is compared with the existing state-of-the-art, as shown in Table 11. It is observed that CMHE-AN had performed better than other models. Since dataset 1 is imbalanced, it is measured in weighted average f1-score, and dataset 2 is measured using the accuracy matrix in literature. Datta et al., [11] uses a combination of linguistics features, frequency of aggressive words, part of speech, lexicon features, statistical features, and achieved a f1 score of 59.45%. Mathur et al., [28] is based on an aggregation of pretrained word2vec, linguistic feature, and character n-gram and attained f1 score of 62.92% using gradient boosted decision tree. In order to increase the size of the dataset, [20] augment TRAC 2 datasets using random insertion and random deletion of words and initialized the training model with random weights.Kumari et al., [22] has proposed an architecture based on finetuning random embedding using 3 layers of LSTM autoencoder and reported f1 score of 74%. Inspired from the subword LSTM proposed by [16, 35] used phonic subword embedding and finetuned with hierarchical LSTM attention and attained an accuracy of 66.6%. Bohra et al., [7] concatenated linguistic feature, character n-gram, and attained the best accuracy using SVM. Mathur et al., [28] transliterated code-mixed text to English without considering the order of words and used supervised features from a prominent English dataset and attained an accuracy of 71%. In [32], pretrained Hindi and English word2vec is concatenated and regression based ensemble is applied on CNN, LSTM, MLP and BERT. It has improved the performance by 1.87% f1-score in comparison to MBert.However, concatenated pretrained embedding is not sufficient for code mixed language understanding. Sharma et al., [36] focused on text normalization to convert Hinglish to Devanagari text and used muril embedding for text representation. We retrained the model on given datasets and attained a comparable result.However, language identification at word level is a challenging and still in research in code mixed data. From Table 11, it is observed that the presented work has shown better results in comparison to existing work.

Table 11 Comparison of the proposed approach with the state of the art

5 Conclusion and future work

This paper proposed a code-mixed hybrid embedding based attention network (CMHE-AN) to classify aggressive, offensive text from code-mixed social media text. This paper has drawn attention towards issues in understanding offensive content for Hindi English code-mixed language such as nonstandard text, spelling variation, OOV words; therefore, to overcome these issues, this paper proposed a code-mixed hybrid embedding (CMHE). CMHE is trained on a large, unsupervised text at the word and character n-gram level. Training at word level captures a contextual relationship in words, while char n-gram effectively captures similar words with various spelling. Thus, the proposed algorithm effectively used word and char n-gram generated features to reduce OOV words and initialized the n/w with discriminative features based on word polarity. Following this, weights generated from CMHE are fine-tuned using Bi-LSTM and the self-attention layer for fetching relevant sequential features for classification. Result section shows that proposed embedding has reduced OOV words to a large extent and successfully captured relative words and words with spelling variation. In addition, ablative study is done to investigate the influence of each component used in proposed architecture. In this, we observed that the addition of the attention layer and CMHE have significantly improved the performance. At last, the effectiveness of the proposed approach is tested on two benchmark datasets as TRAC 2-2020 and Hi-En code-mixed Hate Speech. Extensive experiments show that CMHE-AN outperforms recent state-of-the-art models. The proposed model will map Hi-En code mixed text to dense, low-dimensional space and have knowledge of emotion, insulting, hate related text; therefore, it can substantially contribute to emotion classification, sentiment analysis, toxicity analysis from the perspective of deep semantics. In addition, future study will concentrate on the following points.

  • Addition of user meta information, user pattern and behaviour in posting offensive content can be incorporated as a feature into the model.

  • Integration of word sense disambiguation information while building word embedding can enhance the knowledge to classify close data.

  • So far, the problem is modelled as supervised task, but unsupervised learning can be explored to leverage large amount of Hi-En code-mixed social media content.

  • Attention mechanism can be enhanced based on local and global mutual information, as explained by [26].