1 Introduction

Online social networks (OSNs) are a podium where human interactions occur by posting texts, images, videos, etc. The mode of communication in social media materializes via messages, comments, and chats Ahmed et al. (2020b). The impact of social media nowadays is increasing in both people’s personal lives and their professional circumstances. Comments are the most common and straightforward way which capacitate a reciprocal way of providing an individual point of view. Commenters can easily express their sentiments, opinions, and also responses Ahmed et al. (2021b). Besides bringing ease of communication among people, this has some drawbacks too. Among these, abusive comments and cyberbullying have become an alarming threat to social media users.

Bangla is the world’s sixth-largest language that more than 260 million users speak in daily life. This language also has historical importance. UNESCO also declared the international mother language day based on the Bangla Language Movement happened in 1952. Recently Facebook has become the most influential social media platform for Bengali users. According to a statisticsFootnote 1 3.3% users of Facebook uses the Bangla language, which constitutes 71 million population. A projectFootnote 2 to predict the hate crimes in Germany derived a relationship between hate crimes and social media like Facebook. In Bangladesh, cyberbullying has become a burning problem. Observing some past events of Bangladesh, communal crimes such as rioting erupted in a community due to Facebook. Besides, defamation, bullying, and harassment are also some major crimes that take place on social media platforms. In recent years, Bangladeshi celebrities and influencers are becoming victims of abusive comments after posting about different topicsFootnote 3. So, it is mandatory to monitor Facebook posts, comments, shares to prevent any kind of cybercrimes.

The importance of research on these topics based on different languages is so demandable that many recent works have been conducted. Authors in Nobata et al. (2016) accomplished a machine learning-based abusive comments detection method. Another work on this topic is proposed in Janardhana et al. (2021). In this work, authors classified comments as abusive or non-abusive using convolutional neural network (CNN).

Bidirectional Encoder Representations from Transformers (BERT) has gained popularity since it was introduced in natural language processing. This transformer-based model outperforms many NLP segments like text classification, entity recognition, question answering, etc. Authors in Adhikari et al. (2019), Yu et al. (2019), Ostendorff et al. (2019), and Chia et al. (2019) used BERT for text/document classification and proposed different modified BERT architechtures for better performence. This transformer-based pre-trained language model brings higher accuracy in other NLP sectors like sentiment analysis (e.g., Yu and Jiang (2019), Li et al. (2019), and Su et al. (2020)), question answering (e.g., Yuan (2019)), entity extraction and recognition ( Xue et al. (2019), Souza et al. (2019), and Ashrafi et al. (2020)).

Once BERT disclosed to the NLP world, it ruled. Nevertheless, in 2020 another pre-train language model (PLM) ELECTRA was proposed, and it overcame the limitations that come with mask language models (MLM). ELECTRA train a model with the generator that train like MLM and discriminator responsible for identifying the token replace by the generator. ELECTRA has been proved efficient in many NLP domains like sentiment/emotion analysis (e.g., Xu et al (2020), Al-Twairesh (2021)) , Fake news analysis (e.g., Das et al. (2020)), text mining (e.g., Ozyurt (2020)), etc. It also shows good performance in cyber bullying related works in Pericherla and Ilavarasan (20218), the domain we are focusing on in this paper.

The automated systems on the Bangla language can be helpful for a large number of users around the world. So classifying these Bangla abusive comments and taking proper steps are also as crucial as classifying English ones. However, less work has been accomplished on this issue in other online media (e.g., YouTube) rather than Facebook. Existing ones do not use the latest technologies to bring out results with higher accuracy. In this paper, we want to contribute to this issue using the latest NLP technologies and propose a solution that classifies Bangla abusive comments more accurately. The whole contribution of this work is summarized below-

  • Propose a superlative framework that classifies different types of abusive comments using the latest pre-trained language model (PLM) BERT and ELECTRA.

  • Justify the model’s efficiency based on real-world Bangla abusive comments.

  • Determining significant evolution methods for analyzing the performance of our proposed model.

2 Related work

Sentiment analysis on social media is an emerging research topic nowadays. Different word embedding techniques with ML models achieved results while analyzing sentiments. Samad et al. (2020) used word embedding to classify sentiments from tweeter posts. Salur and Aydin (2020) used different word embedding techniques like with different deep learning techniques. Combined features from Word2Vec, FastText, and char-level embedding were then used in LSTM, GRU, BiLSTM, and CNN models to classify Turkish tweets. Moreover, a hybrid model is built by combining CNN and BiLSTM, which outperforms other models. Classification of three types of sentiments from tweets was studied by Alzamzami et al. (2020) where light gradient boosting machine (LGBM) framework was used. The proposed model compared with six other conventional models like linear regression, support vector machine, random forest, gradient boost, etc., and achieved the best classification results than others. Sentiment classification by using the interaction between tasks was proposed by Zhang et al. (2020) for Chinese blog posts. BiLSTM with attention and CRF were combined to extract features from the text and ERNIE model to classify texts.

Researches are now focusing on developing methods to monitor social media platforms like Facebook, Twitter, Snapchat, and so many. General machine learning approaches are used to classify and detect abusive or toxic comments on social media. For the classification of hateful comments, Salminen et al. (2020) used a machine learning approach. Different ML (machine learning) algorithms like logistic regression, naïve Bayes, support vector machines, XGBoost, and neural networks were used for the classification task. A dataset had been constructed by collecting comments from social media like YouTube, Twitter, Wikipedia, and Reddit. Different methods were used for feature representation like BOW, TF-IDF, Word-2Vec, BERT, and their combination. The XGBoost classifier with a combined feature showed an excellent result by reaching to F1 score of 92%. An approach for classifying comments Kurnia et al. (2020) proposed a model that used SVM as the classifier with Word2Vec embedding. Pre-processing included tokenizing, cleaning, and removal of stop words were compromised. F1 score of 79% was reported in this experiment while classifying comments.

Nowadays, deep learning is widely used in natural language processing tasks. Park and Fung (2017) implemented CNN models for detecting abusive tweets from Twitter. The proposed models were based on character level, word level, and the combination of character- and word-level CNN. The highest F1 measure was achieved using the hybrid CNN model compared to other models when classifying tweets based on racism and sexism. Cyberbullying for the English language, Iwendi et al. (2020) used deep learning algorithms, namely RNN, GRU, LSTM, and BiLSTM. The pre-processing stage consists of cleaning, tokenizing, stemming, and lemmatization. BiLSTM had achieved test accuracy of 82.18% outperformed other models.

There is not much research done for Bengali sentiment analysis on social media due to insufficient data. Some works are done by collecting a small amount of data manually with different ML approaches. To detect abusive comments in social media, Awal et al. (2018) proposed a classifier using the Naïve Bayes algorithm. In this approach, English comments were collected from YouTube, then translating into Bangla. While pre-processing, comments were first tokenized; then, selective words were computed by preparing a bag of word (BOW) vector. The classifier reported accuracy of 80.57%. Emon et al. (2019) proposed an approach in detecting abusive texts based on a deep learning algorithm as well as compare the results with several machine learning algorithms. Different social media and news sites, namely YouTube, Prothom Alo, and Facebook, were used as a data source in this study. Pre-processing step included removal of unwanted digits, punctuation or whitespaces, stemming. For feature extraction, count vectorizer and TF-IDF vectorizer, and word embedding were used. The experimental result showed that the deep learning-based approach RNN (LSTM) achieved the highest accuracy of 82.2% and outperformed other ML algorithms such as naïve Bayes, logistic regression, random forest, and ANN. To classify sentiment and emotions in the Bangla language, Tripto and Ali (2018) built a deep learning model. This model was designed to identify Bangla sentences that were belonged to multi-labeled emotion and sentiment classes. The dataset consists of comments in Bangla, English, and romanized Bangla language from YouTube videos that were used to train the model. The Word2Vec algorithm was used for vector representation, and two models, LSTM and CNN, were used to analyze both sentiments and emotions. LSTM reported accuracy of 53% for five class sentiments and 59% accuracy for emotion classes, while CNN achieved 52% and 54% accuracy, respectively.

Transformer-based models like BERT and ELECTRA are gaining popularity day by day for NLP-related works. BERT model was applied by Yadav et al. (2020) to detect cyberbullying. Two different datasets were used to train and test the model. Reported accuracies were 98% and 96% for Formspring and Wikipedia datasets, respectively. An approach for detecting hostile posts from social media was presented by Shukla et al. (2021) using relational graph convolutional network (RGCN) with BERT embedding. Tweets in the Hindi language were collected and translated into English for training and validation. Furthermore, posts were classified into different categories like offensive, fake, hate. The proposed model achieved an F1 score of 97%. Logistic regression with TF-IDF and DBOW was used to build a Chatbot by Bauer et al. (2019) for detecting sexual harassment and their types. Furthermore, a fine-tuned BERT model was used for the Named entity recognition task. The proposed methods identify harassing comments with over 80% accuracy while location and dated with over 90% accuracy. For biomedical text analysis, Ozyurt (2020) used the pre-trained ELECTRA model and showed that the proposed model performs better than the Bert model. For the detection of sentiments and sarcasm from the Arabic language ELECTRA model was used by Farha and Magdy (2021). To identify the fake news spreader on Twitter, Das et al. (2020) used ensembled ELECTRA models on Spanish and English languages.

3 Preliminary and proposed framework

3.1 Transformer-based learning

Transformer-based learning brings revolutionary changes in the field of natural language processing. This architecture operates sequential inputs using an attention mechanism. Like RNNs, it also has an encoder–decoder structure. Here the encoder maps input sequence (\(x_1\),...,\(x_n\)) to a continuous representation z (\(z_1\),...,\(z_n\)) with auto-regressive steps. Lastly, the decoder generates an output sequence (\(y_1\),...,\(y_m\)). The encoder and decoder stacks have point-wise and fully connected layers and a self-attention mechanism.

Encoder and Decoder Stacks Both Stacks contains N = 6 layers with 2 sublayers. The first sublayer is the multi-head self-attention mechanism and the second one is about a feed-forward network, which is position-wise fully connected. For the sublayer function, Sublayer (x) the output of sublayers is LayerNorm (x + Sublayer (x)) and the dimension of output \(d_{\mathrm{model}}=512\). Unlike the encoder, the decoder has a third layer to apply multi-head attention to the output.

Attention Transformers utilize the multi-head self-attention mechanism. Three different uses of this attention mechanism have been implemented here. They are:

  • The layered decoder passes the queries to the next, and the output of the encoder generates the memory keys and values in encoder–decoder attention layers.

  • For encoder self-attention layers, all queries, keys, and values are generated from the same place, which is the output of the previous layer’s encoder.

  • The auto-regressive property is maintained in the decoder by preventing the leftward information flow. This is implemented inside the scaled dot product attention by masking out values (setting to -) for the softmax’s input corresponding to all illegal connections.

Fig. 1
figure 1

Transformer model architecture (this figure’s left and right halves sketch how the encoder and decoder of the transformer, respectively, work using point-wise fully connected layers with stacked self-attention)

Figure 1 shows the visual representation of transformer-based model architecture.

3.1.1 Bidirectional encoder representations from transformers (BERT)

BERT is a powerful transformer-based architecture that provides state-of-the-art results in various NLP tasks. It is a multilayered bidirectional transformer encoder Devlin et al. (2018). Input for BERT can be unambiguously represented as a token sequence consist of one sentence or a couple of sentences. In this sequence, the first token is the classification token [CLS]. For a couple of sentences packed together as input, and after that, BERT separates the sentence into two steps. Firstly a special token [SEP] is used. Then learning embeddings are added to each token. It indicates whether the separated sentence was the first or the second one in the packed couple.

BERT Framework has two steps, and they are pre-training and fine-tuning. These two steps are explained below.

Pre-training BERT Unlike left-to-right or right-to-left models, BERT is pre-trained as a mask language model (MLM). It has been pre-trained with unlabeled data using unsupervised learning. During this process, some input tokens are randomly masked and then predicted. It is also pre-trained to capture the relationship in coupled sentences. BERT is pre-trained with BooksCorpus (800M words) Zhu et al. (2015) and English Wikipedia’s text passages (not list, headers, or tables) (2500M words).

Fine-tuning BERT For both single and coupled sentences, BERT is allowed for many downstream tasks. For that, it swaps proper input and outputs. During fine-tuning initially, BERT uses pre-trained parameters, and then, all these parameters are fine-tuned using labeled data downstream tasks. We sketch the BERT architecture in Fig. 2.

Fig. 2
figure 2

BERT Architecture (to initialize the model, BERT uses the pre-trained model parameters and, during the fine-tuning, it fine-tunes the parameters. [CLS] added to the front of the input stream)

3.1.2 Multilingual BERT (mBERT)

Multilingual BERT (mBERT) Libovickỳ et al. (2019) facilitates 104 languages to be pre-trained in BERT. This architecture is able to splinter between language-neutral components and language-specific components. The probing tasks evaluated on mBERT are:

Language Identification The linear classifier is trained on the top of sentence representation and trying to identify the sentence’s language.

Language Similarity On average, languages with similarities have similarities in POS tagging Pires et al. (2019). These similarities are quantified with V-measure Rosenberg and Hirschberg (2007) on language clusters by language families.

Parallel Sentence Retrieval For each sentence in parallel pair, the cosine distance between its representation and all sentence’s representation on the same parallel side is computed. The sentence with the smallest distance is selected.

Word Alignment Word alignment is determined as a minimum weighted edge cover of a bipartite graph.

Machine Translation (MT) Quality Estimation The cosine distance of the source sentence’s representation and MT output’s reflection used to evaluate.

Fig. 3
figure 3

Pre-training procedure of ELECTRA (it depicts how replaced tokens are detected. The generator is trained with the maximum likelihood that brings out an output distribution over tokens, and then the discriminator is fine-tuned for downstream tasks)

3.2 ELECTRA

To detect and classify abusive comments, we have utilized another transformer-based architecture, namely ELECTRA Clark et al. (2020). ELECTRA is a comparatively smaller transformer with satisfying high performance. ELECTRA uses two different neural networks, Generator G and Discriminator D. Both of them have an encoder, and it maps the input tokens’ sequence \(x = [x_1,\ldots ,x_n ]\) into contextualized vector representations’ sequence \(h(x) = [h_1,\ldots ,h_n ]\). Using the softmax layer for the generation of a specific token \(x_t\) at given position t the output is:

$$\begin{aligned} p_G(x_t\mid \mathbf { x }) = \mathbf {exp} (e(x_t)^Th_G(x)_t)/\sum _{{x}'}exp (e({x}')^Th_G(x)_t) \end{aligned}$$
(1)

Here e represents the token embeddings. The discriminator predicts the realness of token \(x_t\) at position t. That means whether this token comes from data or the generator distribution. For this prediction, it uses a sigmoid output layer given below:

$$\begin{aligned} D(x,t)=\mathbf {sigmoid}(w^Th_D(x)_t) \end{aligned}$$
(2)

The generator is trained for performing MLM. For input \(x = [x_1, x_2,\ldots ,x_n ]\), MLM mask out \(m = [m_1,\ldots ,m_k ]^3\) by selecting a set of some random position. [MASK] token replace tokens from those positions as

$$\begin{aligned} x^{\mathrm {masked}}=\mathrm {REPLACE}(x, m, \text {[MASK]}) \end{aligned}$$
(3)

Now the generator is able to predict the original identities for masked-out tokens. Now the discriminator learns to differentiate tokens that have been alternated in the generator. For example, if masked-out tokens replace \(x^{\mathrm{corrupt}}\) by generator MLM, the discriminator is trained to predict the tokens in \(x^{\mathrm{corrupt}}\) are matched with input x. The construction of these model input follows:

$$\begin{aligned} \begin{aligned} m_i\sim \mathrm {unif}\{1, n\}\ \mathrm {for}\ \textit{i}=1\ \mathrm {to}\ k \\ x^{\mathrm {masked}}=\mathrm {REPLACE}(x, m, \text {[MASK]}) \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} \begin{aligned} \hat{x}_i \sim p_G (x_i\mid x^{\mathrm {masked}})\ \mathrm {for}\ i \in m \quad \\ x^{\mathrm {corrupt}} =\mathrm {REPLACE}(x, m,\hat{x}) \end{aligned} \end{aligned}$$
(5)

The loss functions are-

$$\begin{aligned} {\mathcal {L}}_{\mathrm {MLM}} (x,\theta _{G})={\mathbb {E}} \left( \sum _{i \in m}-\mathrm {log} p_G(x_i\mid x^{\mathrm {masked}})\right) \end{aligned}$$
(6)
$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\mathrm {Disc}}(x,\theta _{D})= {\mathbb {E}}\Bigg ( \sum _{t=1}^{n}-\mathbb {1} (x_t^{\mathrm {corrupt}}=x_t) \\ \mathrm {log} D(x^{\mathrm {corrupt}},t) - \mathbb {1} (x_t^{\mathrm {corrupt}}\ne x_t) \\ \mathrm {log} (1-D(x^{\mathrm {corrupt}},t)) \Bigg ) \end{aligned} \end{aligned}$$
(7)

Following the same process, the generator is also trained. However, training the generator is a more complicated task and has some key differences. If the generator finds any correct token, it labels it as ‘real.’ The training process focuses on maximum likelihood.

The maximum combined loss over a large corpus is computed by-

$$\begin{aligned} \underset{\theta _G,\theta _D}{\mathrm {min}}\sum _{x\epsilon {\mathcal {X}}}{\mathcal {L}}_{\mathrm {MLM}}(x,\theta _G)+\lambda {\mathcal {L}}_{\mathrm {Disc}}(x,\theta _D) \end{aligned}$$
(8)

Figure 3 shows the visual presentation of the pre-train procedure of ELECTRA.

Fig. 4
figure 4

Our proposed framework for classifying Bangla abusive comments (here depicts the working procedure of our classifier, starting from the text retrieving to classifying abusive comments using transformer-based architectures)

BERT and ELECTRA, both transformer-based architecture, have several layers, hidden sizes, and parameters. We broach these layers, hidden sizes, and parameters in Table 1.

Table 1 Layers, hidden size and parameters of different models

3.3 Proposed framework

Figure 4 represents our proposed Framework. We have preprocessed the comment texts from the dataset before train our model. We train our model using preprocessed comments and apply transformer-based learning and enable the model to classify abusive comments. Finally, we fine-tuned the model using different values of hyperparameters. This step helps the model to predict classes more accurately.

4 Experiments and results

4.1 Environment specifications

To train deep learning models, very high computing power is needed for the parallel processing of tasks. In this regard, we have used Google Colab, which is a cloud-based Jupyter notebook platform with required hardware options such as GPU and TPU on cloud Carneiro et al. (2018). Google Colab provides python runtime with pre-configured libraries and packages for deep learning-based tasks. It operates under Ubuntu OS with Tesla k-80 GPU of NVIDIA with 12 GB of GPU memory.

4.2 Experimental dataset

There is a scarcity of dataset which contains Bangla abusive comments in a categorized manner. Recently, Ahmed et al. (2021a) published a dataset that contains labeled comments from Facebook to aid NLP researchers. This dataset is focused on detecting comments about bully or harassment. A total of five classes of harassment that are mostly used in social media are presented here. These five classes include sexual, troll, religious, threat, and not bully. The total number of comments this dataset includes is 44001. we have also shown some examples of different comments based on their respective classes in Table 2.

Table 2 Examples of different comments from the dataset with their respective classes

Overall dataset splitting and some statistical information about our dataset with five classes show in Table 3. Here, the average non-tokenized sequence length and maximum non-tokenized sequence length are mentioned by the Average Length and Max Length column, respectively. Furthermore, the average and maximum tokenized length are pointed out to the column of Average Token Length and Max Token Length.

Table 3 Overall dataset splitting

4.3 Data preprocessing

The dataset contains raw comments with special characters such as #, @, and - along with emoji, white spaces, HTML tag, URLs, and punctuations. These are unnecessary and removed from the texts. Since we focus on the only language, any comments or texts containing more than 20% of other languages have been removed. For feature extraction, we have applied the Tri-gram model with word tokenization. Tokenization is done in such a way that it worked efficiently with the models. Firstly, basic tokenization is applied, followed by wordpiece tokenization. Along with tokenization, we also applied Lemmatization, Stemming, and Sentence Segmentation Ahmed et al. (2020a).

4.3.1 Lemmatization based on Levenshtein distance

Bangla is a shallow orthographic language with lots of regional Regional dialect differences. There are diversities in Bangla words; even sometimes, a single word can exist in different appearances. For example, the word can be used as , , , , , , , , etc. in different situations. For the attainment of effective results, we need to lemmatize these words in their root. In that scheme, we determine the Levenshtein distance of words and lemmatize them into root words. Levenshtein distance stipulates how dissimilar two words are from one another, which is the number of operations (insert, delete, edit) needed to effectuate for transforming one string to another. The higher value indicates higher dissimilarity.

The function mentioned in Eq. 9 is the function used to obtain Levenshtein distance. Thie Levenshtein distance function is written to designate the diversity in two words length of \(|w_1|\) and \(|w_2|\). We also add examples in Tables 4 and 5. The root word for is . has fewer edit distances from than word .

$$\begin{aligned} \text {lev}_{w_1,w_2}(i,j)={\left\{ \begin{array}{ll} \text {max}(i,j)\quad \quad \quad \ \quad \ \ \quad \ \quad \quad \text { if } \mathrm{min}(i,j) \\ \text {min}{\left\{ \begin{array}{ll} \text {lev}_{w_1,w_2}(i-1,j)+1 &{} \\ \text {lev}_{w_1,w_2}(i,j-1)+1 \quad \text { otherwise} \\ \text {lev}_{w_1,w_2}(i-1,j-1)+1_{(w_1\ne w_2)} &{} \end{array}\right. } \end{array}\right. } \end{aligned}$$
(9)
Table 4 Example of edit distance
Table 5 Example of edit distance View full size image

4.4 Performance evaluation metrics

We have used different metrics to evaluate the perfection of our work. Confusion metrics visualize correct and incorrect predictions of signs, and Evolution metrics verify the model’s performance. True positive (TP) and true negative (TN) indicate the correct prediction of a model. On the other hand, false positive (FP) and false negative (FN) detect wrong predictions Ahmed et al. (2021c). Using these four types of confusion metrics, we can generate numerous evolution metrics. To verify our model performance, we have to determine the accuracy, recall, precision, and F1 score of the model. The formulas of these evolution metrics are given here:

$$\begin{aligned} \mathrm{Accuracy}= & {} \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}} \end{aligned}$$
(10)
$$\begin{aligned} \mathrm{Precision}= & {} \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} \end{aligned}$$
(11)
$$\begin{aligned} \mathrm{Recall}= & {} \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} \end{aligned}$$
(12)
$$\begin{aligned} F1 \, \mathrm{score}= & {} \frac{2 * \mathrm{Precision} * \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} \end{aligned}$$
(13)

4.5 Results and discussions

Table 6 Average classification results for BERT-Base model with different learning rate
Table 7 Class-wise classification result of ELECTRA-Base with learning rate

The results of our experiments are given in this segment. Firstly, we present the class-wise classification results for the both BERT and ELECTRA model in Tables 6 and 7 . We also demonstrate the results in terms of different learning rates. We find that, among five learning rates, both models perform better for a specific learning rate of \(2e-04\). The highest scores for the metrics (precision, recall, and F1 score) are reported for this learning rate.

We also present the normalized confusion matrix for the BERT model in Fig. 5a. It can be seen from the confusion matrix that the highest true positive rate is 0.92, which is seen for the Troll class. Among other classes, the Sexual and Threat classes reported higher true positive rates. However, the other two classes, which are Not Bully and Religious, showed lower TP rates. For the normalized confusion matrix of the ELECTRA model in Fig. 5b, the Troll and Threat classes achieved the highest TP rates of 0.95 and 0.90, respectively. A lower TP rate is reported for Sexual class. When comparing both models, it can be seen that the Troll class has the highest TP rate for both models. Nevertheless, the Not Bully class reported the lower TP rate in the Bert model, but it is the Sexual class in the Electra model.

In Fig. 6, we showed accuracy and loss for both BERT and ELECTRA models over each epoch. From Fig. 6a, it can be seen that training accuracy over epochs for the ELECTRA model is increasing with test accuracy. After some iteration, the line graph becomes much stable. When it comes to loss, both train and test loss are gradually decreasing. When the model reaches its stable state, train and test loss is closer to zero, and it constitutes a model which is good in generalization. The highest training accuracy is 97.87%, while the test accuracy is 84.92%. In Fig. 6b, the train and test progression of the BERT model is presented for each epoch. Here we also get a stable state of both accuracy and loss. The reported maximum training accuracy is 98.09%, where the maximum test accuracy is 85.00%.

Fig. 5
figure 5

Confusion matrix for BERT-Multilingual and ELECTRA-Base models in a normalized form

There are other variants of both BERT and ELECTRA models. To compare the performance among these variants, in Table 8 we present the performance of different variants of BERT and ELECTRA in terms of precision, recall, and f1-score. We tested the models for different learning rates and got good results for the learning rate of 2e−04. From this tabular representation, it can be seen that the BERT-Base and ELECTRA-base model outperforms others. ELECTRA-large model also performed well for learning rate 2e−04.

Table 8 Comparison between different variants of BERT and ELECTRA models with different learning rates
Fig. 6
figure 6

Accuracy and Loss for ELECTRA and BERT model over epochs (We run our models for 30 epochs. For each epoch, we get the training progression of our models in terms of accuracy and loss, which is represented).

We have also compared the results of the BERT and ELECTRA model on our dataset with other deep learning-based approaches in Fig. 7. LSTM, Bi-LSTM, LSTM-GRU, Graph CN is implemented alongside the BERT and ELECTRA model variants. This representation shows that the LSTM model performed with the lowest test accuracy of 76.9%. The highest test accuracy is 85% which is observed for the BERT-base model. In the meantime, the ELECTRA-Base model reached the second-highest test accuracy of 84.92%. We present the results for the learning rate of 2e−04 since we find that all the models performed well compared to other learning rates on this learning rate.

Fig. 7
figure 7

Test accuracy comparison between different models.

4.6 Observation of BERT and ELECTRA architecture

In this subsection, we will investigate the performance of the BERT and ELECTRA model on datasets containing different classes of text or comments, or sentences. We used three open-source dataset for this purpose, named ProthomAloFootnote 4, BARD Alam and Islam (2018) and OSBCFootnote 5 dataset. These datasets include sentences of different classes like sports, entertainment, politics, economy, technology, crime, art, opinion, education, etc.

In Tables 9 and 10, we showed the obtained results of our BERT-base and ELECTRA-base models. It can be seen that both the models performed well while classifying different classes of Bangla texts. For the ProthomAlo dataset, BERT and ELECTRA models achieved a classification accuracy of 97.23% and 95.82%. On the other hand, for BARD and OSBC datasets, the ELECTRA model performed slightly better than the BERT model. Other metrics like precision, recall, and f1-score are also significant for NLP tasks like text classification. The signification is that both BERT and ELECTRA models have excellent generalization capabilities while classifying Bengali texts.

Table 9 Performance of BERT-base model on datasets containing differnt classes of Bengali texts
Table 10 Performance of ELECTRA-base model on datasets containing differnt classes of Bengali texts

Transformer-based models avoid recursion, process the texts as a whole. It automatically extracts the relationship between words by employing techniques like multi-head attention and positional embeddings. BERT is multilingual and is trained on Wikipedia corpus. In the data sampling phase, weights are adjusted exponentially, and words from low resource language like Bangla are represented better than other models like CNN or RNN. On the other hand, ELECTRA utilizes the replaced token detection (RTD) method, which works more efficiently than BERT in terms of computation performance. A generator network is employed to replace the tokens with alternative samples. Unlike the BERT model, which uses MASK modeling, the ELECTRA model replaces tokens with plausible or fake samples. This strategy helps the network to learn a better representation of words.

To the best of our knowledge, the research effort we presented in this paper is unique. No works are focusing on Bangla comments on Facebook. BERT and ELECTRA models were used previously for languages like English, German, Arabic, etc. However, our study emphasizes Facebook data for a specific Bangla language, which is novel. Besides, our experiment is conducted on a more structured dataset with an immense collection of data. We also presented the effects of the different learning rates for the pre-training of transformer-based models. Considering all our findings and results gives us the confidence that our proposed approach can accurately detect Bangla abusive comments on Facebook and other social media platform.

5 Conclusion and future work

We attempt to bring out an automated intelligent solution to the latest increasing cyberbullying issues in Bangladesh. We proposed a transformer-based system that is capable of classifying abusive comments written in the Bangla language. Two latest transformer-based architectures, BERT and ELECTRA, are implemented for the Bangla language here. These two efficient architectures bring out remarkable accuracy in our experiment. Furthermore, we conducted our experiments on real-world abusive comments taken from social media (Facebook). To justify our classifier, we mention some evolution processes. We also determine the related confusion matrix and evolution matrix based on our classifier’s predictions. The value of precision, recall, and f1-score for different classes indicates how correctly our model classifying the abusive comments. We also show the loss of BERT and ELECTRA over each epoch for our experiment. Then shows the comparison of different deep learning architecture accuracy to bring out the performance of our proposed models.

In the future, to increase the efficiency of our classifier, we want to train it with other regional language forms of Bangla. Our additional focus is to identify abusive comments at an initial stage for any application with the power of REST API and GraphQL. We also plan to develop an automated apparatus to detect spam users and block or report those users.