1 Introduction

Social media is a very popular way for people to express their opinions publicly and to interact with others online. In aggregation, social media can provide a reflection of public sentiment on various events. Unfortunately, any user engaging online, either on social media, forums or blogs, will always have the risk of being targeted or harassed via abusive language, expressing hate in the form of racism or sexism, with possible impact on his/her on-line experience, and the community in general. The existence of social networking services creates the need for detecting user-generated hateful messages prior to publication. Any published text that is used to express hatred towards particular groups with the intention to humiliate its members is considered a hateful message.

Although hate speech is protected under the free speech provisions in some countries, e.g. the United States, there are other countries, such as Canada, France, United Kingdom, and Germany, where there are laws prohibiting it from promoting violence or social disorder. Social media services such as Facebook and Twitter have been criticized for not having done enough to prohibit the use of their services for attacking people belonging to some specific race, minority etc. [15]. They have announced though that they would seek to battle against racism and xenophobia [5]. Nevertheless, the current solutions deployed by, e.g., Facebook and Twitter have so far been to address the problem with manual effort, relying on users to report offensive comments [3]. This not only requires a huge effort by human annotators, but it also has the risk of applying discrimination under subjective judgment. Moreover, a non-automated task by human annotators would have strong impact on the response time, since a computer-based solution can accomplish this task much faster than humans. The massive rise in the user-generated content in the above social media services, with manual filtering not being scalable, highlights the need for automating the process of on-line hate-speech detection.

Despite the fact that the majority of the solutions for automated detection of offensive text rely on Natural Language Processing (NLP) approaches, there have lately been a tendency towards employing pure machine learning techniques like neural networks for that task. NLP approaches have the drawback of being complex, and to a large extent dependent on the language used in the text. This provides a strong motivation for employing alternative machine learning models for the classification task. Moreover, the majority of the existing automated approaches depend on using pre-trained vectors (e.g. Glove, Word2Vec) as word embeddings to achieve good performance from the classification model. This makes the detection of hatred content unfeasible in cases where users have deliberately obfuscated their offensive terms with short slang words.

There is a plethora of unsupervised learning models in the existing literature to deal with hate-speech [21], as well as in detecting the sentiment polarity in tweets [2]. At the same time, the supervised learning approaches have still not been explored adequately. While the task of sentence classification seems similar to that of sentiment analysis, in hate-speech even negative sentiment could still provide useful insight. Our intuition is that the task of hate-speech detection can further benefit from the incorporation of other sources of information to be used as features into a supervised learning model. A simple statistical analysis on an existing annotated dataset of tweets [24], can easily reveal the existence of significant correlation between the user tendency in expressing opinions that belong to some offensive class (Racism or Sexism), and the annotation labels associated with that class. More precisely, the correlation coefficient value that describes such user tendency was found to be 0.71 for racism in the above dataset, while that value reached as high as 0.76 for sexism. In our opinion, utilizing such user-oriented behavioral data for reinforcing an existing solution is feasible, because such information is retrievable in real-world use-case scenarios like Twitter. This highlights the need to explore the user features more systematically to further improve the classification accuracy of a supervised learning system.

Our approach employs a neural network solution composed of multiple classifiers based on Long-Short-Term-Memory (LSTM) and utilizes user behavioral characteristics such as the tendency towards racism or sexism to boost performance. Although our technique is not necessarily revolutionary in terms of the deep learning models used, we show in this paper that it is quite effective.

Our main contributions are: i) a deep learning architecture for text classification in terms of hateful content, which incorporates features derived form the users’ behavioral data, ii) a language agnostic solution, due to no-use of pre-trained word embeddings, for detecting hate-speech, iii) an experimental evaluation of the model on a Twitter dataset, demonstrating the top performance achieved on the classification task. We put special focus on investigating how the additional features concerning the users’ tendency to utter hate-speech, as expressed by their previous history, could leverage the performance. To the best of our knowledge, there has not been any previous study on exploring features related to the users tendency in hatred content that has used a deep learning model.

The rest of the paper is organized as follows. In Section 2, we describe the problem of hate speech in more detail. In Section 3, we discuss existing related work. In Section 4, we present our proposed model. In Section 5, after presenting the dataset, we describe our experimental evaluation and discuss the results from the experiments. Finally, in Section 6, we conclude the paper and outline possible future work.

2 Problem statement

The problem we address in this work can be described as follows: We are given a set of posting written by a number of online users. Each posted short-text is associated with a class-label, where we consider the classes “Neutral” (N), “Racist” (R) and “Sexist” (S). From a training-set of labeled short-texts, we set out to train a classifier that when receiving a new posting from a given user can extract and combine information in the training-data about short-text messages in general and the posting-history of the active user in particular to successfully classify the new posting as either “N”, “S” or ”R”. The research question we address in this work is thus:

How to effectively identify the class of a new posting, given the identity of the posting user and the history of postings related to that user?

To answer this question, our main goals can be summarized as follows:

  • To develop a novel method that can improve the state-of-the-art approaches within hate-speech classification, in terms of classification performance/accuracy.

  • To investigate the impact of incorporating information about existing personalized labeled postings from users’ past history on the classification performance/accuracy.

Note that existing solutions for automatic detection still fall short of effectively detecting abusive messages. Therefore there is a need for new algorithms, which would do the job of classification of such content more effectively and efficiently. Our work is a step in that direction.

3 Related work

Simple word-based approaches, if used for blocking the posting of text or blacklisting users, not only fail to identify subtle offensive content, but they also affect freedom of speech and expression. The word ambiguity problem – that is, a word can have different meanings in different contexts – is mainly responsible for the high false positive rate in such approaches. Ordinary NLP approaches on the other hand, despite their popularity [21], are ineffective to detect unusual spelling, experienced in user-generated comment text. This is best known as the spelling variation problem, and it is caused either by unintentional or intentional replacement of single characters in a token, aiming to obfuscate the detectors. In general, the complexity of the natural language constructs renders the task quite challenging. Irrespective of the use of NLP approaches, we can distinguish two major categories in the existing solutions to the hate-speech problem: The Unsupervised learning and the Supervised learning.

Unsupervised learning approaches are quite common for detecting offensive messages in text, and essentially are applied concepts from NLP to exploit the lexical syntactic features of sentences [4], or used AI-solutions and bag-of-words-based text-representations [23]. The latter is known to be less effective for automatic detection, since hatred users apply various obfuscation tricks, such as replacing a single character in offensive words. For instance, applying a binary classifier onto a paragraph2vec representation of words has already been attempted on Amazon data in the past [7], but it only performed well on a binary classification problem. Another unsupervised learning based solution is the work in [25], in which the authors proposed a set of criteria that a tweet should exhibit in order to be classified as offensive. They also showed that differences in geographic distribution of users have only marginal effect on the detection performance. Despite the above observation, we explore other features that might be possible to improve the detection accuracy in the solution outlined below. In [24] is applied a crowd-sourced solution to tackle hate-speech, with the creation of an additional dataset of annotations to extend the existing corpus. They also investigated the impact of the experience of annotators in the classification performance.

As far as the supervised learning classification methods, their employment in the detection of hate-speech is not new. In [6] is described another way of distinguishing hate-speech from offensive language in tweets, based on a classifier mode that involves Naive Bayes, Decision Trees and SVM. Also, in [16] is attempted to discern abusive content with an NLP-based supervised model combining various linguistic and syntactic features in the text, considered at character uni-gram and bi-gram level, and tested on Amazon data. Jha and Mamidi [11] dealt with the classification problem of tweets, but their interest was on sexism alone, which they distinguished into ‘Hostile’, ‘Benevolent’ or ‘Other’. While the authors used the dataset of tweets from [25], they treated the existing ‘Sexism’ tweets as being of class ‘Hostile’, while they collected their own tweets for the ‘Benevolent’ class, on which they finally applied the FastText classifier [12], and SVM.

The supervised learning models also include the Deep Neural Networks (DNNs). Their power comes from their ability to find data representations that are useful for classification and they are widely explored to handle NLP tasks. Convolution Neural Networks (CNN) [14] and Recurrent Neural Networks (RNN) [8] are the two main architectures of DNNs, which NLP has benefited from. CNNs are suited for multi-dimension input data sampled periodically, in which a number of adjacent inputs are convoluted into the next layer in the network. RNN can be thought of as the addition of loops to the architecture through back propagation in the training process, to update the network weights in every layer. LSTMs are special RNNs which allow arbitrary propagation of signals into the network, thus being sensitive to the order of values. In [22] is reported performance for a simple LSTM classifier not better than an ordinary SVM, when evaluated on a small sample of Facebook data for only 2 classes (Hate, No-Hate), and 3 different levels of strength of hatred. [1] approached the issue with a neural network-based model that uses LSTM, with features extracted by character n-grams, and assisted by Gradient Boosted Decision Trees. Their method achieved higher score over the same dataset of tweets than any unsupervised learning solution known so far. CNNs has also been explored as a potential solution in the hate-speech problem in tweets, with character n-grams and word2vec pre-trained vectors being the main tools. For example, in [19] classification is transformed into a 2-step problem, where abusive text is first distinguished from the non-abusive, and then the class of abuse (Sexism or Racism) is determined. In [9] is employed pre-trained CNN vectors in an effort to predict the four classes, and finally achieving a slightly higher F-score than character n-grams. A summary of the existing approaches in the problem of hate speech, along with their characteristics, is presented in Table 1.

Table 1 Cartography of existing research in hate-speech detection

In general, we can point out the main weaknesses of NLP-based models in their non-language agnostic nature and the low scores in detection. In spite of their high popularity [21], when used either in supervised or unsupervised learning models, we believe there is still a high potential for DNNs to further contribute to the issue. At this point it is also relevant to note the inherent difficulty of the hate-speech challenge itself, which can be clearly noted by the fact that no solution thus far has been able to obtain an F-score above 0.93.

4 Description of our recurrent neural network-based approach

In our experimentation we use a powerful type of RNN known as Long Short-Term Memory Network (LSTM). Inspired by the work in [1], we experiment with combining various LSTM models enhanced with a number of novel features in an ensemble. More specifically we introduce:

  • A number of additional features concerned with the users’ tendency towards hatred behavior.

  • An architecture, which combines the output by various LSTM classifiers to improve the classification ability.

4.1 Features

We first elaborate on the details of the features derived to describe each user’s tendency towards each class (Neutral, Racism or Sexism), as captured in their tweeting history. In total, we define the three features \(t_{Na}\), \(t_{Ra}\), \(t_{Sa}\), representing a user’s tendency towards posting Neutral, Racist and Sexist content, respectively. We let \(m_{a}\) denote the set of tweets by user a, and use \(m_{N,a}\), \(m_{R,a}\) and \(m_{S,a}\) to denote the subsets of those tweets that have been labeled as Neutral, Racist and Sexist respectively. Now, the features are calculated as tN, a = |mN, a|/|ma|, tR, a = |mR, a|/|ma|,and tS, a = |mS, a|/|ma|.

Furthermore, we choose to model the input tweets in the form of vectors using word-based frequency vectorization. That is, the words in the corpus are indexed based on their frequency of appearance in the corpus, and the index value of each word in a tweet is used as one of the vector elements to describe that tweet. We note that this modelling choice provides us with a big advantage, because the model is independent of the language used for posting the message.

4.2 Classification

To improve classification ability, we employ an ensemble of LSTM-based classifiers. The employment of ensembles is a known technique used for improving the classification performance of a single model [17]. In this work, we apply the ensembles paradigm in our proposed solution to the hate-speech problem. In total, the scheme comprises a number of classifiers (3 or 5), each receiving the vectorized tweets together with behavioural features (see Section 4.1) as input.

The choice of various characteristics was done with the purpose to train the neural network with any data associations existing between the attributes for each tweet and the class label given to that tweet. In each case, the characteristic feature is attached to the already computed vectorized content for a tweet, thereby providing an input vector for one LSTM classifier. A high level view of the architecture is shown in Fig. 1, with the multiple classifiers. The ensemble has two mechanisms for aggregating the classifications from the base classifiers; namely Voting and Confidence. Majority Voting is a known method to maximize the performance gain with the lowest number of classifiers [10, 18, 20]. In our work, we used a simpler rule for our specific needs. That is, the preferred method is majority voting, which is employed whenever at least two of the base classifiers agrees with respect to classification of a given tweet. When all classifiers disagree, the classifier with the strongest confidence in its prediction is given preference. The conflict resolution logic is implemented in the Combined Decision component.

Fig. 1
figure 1

High level view of the system with multiple classifiers

We present the above process in Algorithm 1. Here, mode denotes a function that provides the dominant value within the inputs classes \(id_{1},id_{2},id_{3}\) and returns NIL if there is a tie, while classifier is a function that returns the classification output in the form of a tuple (N eutral, R acism, S exism).

figure d

5 Evaluation setup - results

5.1 Data preprocessing

Before training the neural network with the labeled tweets, it is necessary to apply the proper tokenization to every tweet. In this way, the text corpus is split into word elements, taking white spaces and the various punctuation symbols used in the language into account. This was done using the MosesFootnote 1 package for machine translation.

We chose to limit the maximum size of each tweet to be considered during training to 30 words, and padded tweets of shorter size with zeros. Next, tweets are converted into vectors using word-based frequency, as described in Section 4.1. To feed the various classifiers in our evaluation, we attach the feature values onto every tweet vector.

In this work we experimented with various combinations of attached features \(t_{N,a}\), \(t_{R,a}\), and \(t_{S,a}\) that express the user’s tendency. The details of each experiment, including the resulting size of each embedding can be found in Table 2, with the latter denoted ‘input dimension’ in the table.

Table 2 Combined features in the proposed schemes

5.2 Deep learning model

In our evaluation of the proposed scheme, each classifier is implemented as a deep learning model having four layers, as illustrated in Fig. 2, and is described as follows:

  • The Input (a.k.a Embedding) Layer. The input layer’s size is defined by the number of inputs for that classifier. This number equals the size to the word vector plus the number of additional features. The word vector dimension was set to 30 to be able to encode every word in the vocabulary used.

  • The hidden layer. The sigmoid activation was selected for the the hidden LSTM layer. Based on preliminary experiments the dimensionality of the output space for this layer was set to 200. This layer is fully connected to both the Input and the subsequent layer.

  • The dense layer. The output of the LSTM was run through an additional layer to improve the learning and obtain more stable output. The ReLU activation function was used. Its size was selected equal to the size of the input layer.

  • The output layer. This layer has 3 neurons to provide output in the form of probabilities for each of the three classes Neutral, Racism, and Sexism. The softmax activation function was used for this layer.

In total we experimented with 11 different setups of the proposed scheme, each with a different ensemble of classifiers, as shown in Table 3.

Fig. 2
figure 2

Our deep learning model

Table 3 Evaluated ensemble schemes

5.3 Dataset

We experimented with an existing dataset of approximately 16k short messages from Twitter [25]. The dataset contains 1943 tweets labeled as Racism, 3166 tweets labeled as Sexism and 10889 tweets labeled as Neutral (i.e., tweets that neither contain sexism nor racism). There is also a number of dual labeled tweets in the dataset. More particularly, we found 42 tweets labeled as both ‘Neutral’ and ‘Sexism’, while six tweets were labeled as both ‘Racism’ and ‘Neutral’. According to the dataset providers, the labeling was performed manually.Footnote 2

The relatively small number of tweets in the dataset makes the task more challenging. As reported by several authors already, the dataset is imbalanced, with a majority of neutral tweets. Nevertheless, the size of the weakest class (Racism) being almost 5 times smaller than the size of the stronger class (Neutral), does not impose a strong level of imbalance in the dataset. Therefore, we chose to not apply any adjustments onto the data. Additionally, we used the public Twitter API to retrieve additional data associated with the user identity for each tweet in the original dataset.

5.4 Experimental setting

To produce results in a setup comparable with the current state of the art [1], we performed 10-fold cross validation and calculated the Precision, Recall and F-Score for every evaluated scheme. We randomly split each training fold into 15% validation and 85% training, while performance is evaluated over the remaining fold of unseen data. The model was implemented using KerasFootnote 3. We used categorical cross-entropy as the learning objective, and selected the ADAM optimization algorithm [13]. Furthermore, the vocabulary size was set to 25000, and the batch-size during training was set to 500.

To avoid over-fitting, the model training was allowed to run for a maximum number of 100 epochs, out of which the optimally trained state was chosen for the model evaluation. An optimal epoch was identified, such that the validation accuracy was maximized, while at the same time the error remained within \(\pm 1\%\) of the lowest ever figure within the current fold. Throughout the experiment we observed that the optimal epochs typically occurred between the 30 and 40 epochs.

To achieve stability in the results produced, we ran every single classifier for 15 times and the output values were aggregated. In addition, the output from each single classifier run was combined with the output from another two single classifiers to build the input of an ensemble, producing \(15^{3}\) combinations. For the case of the ensemble that incorporates all five classifiers we restricted to using the input by only the first five runs of the single classifiers (55 combinations). That was due to the prohibitively very large number of combinations that were required.

5.5 Results

We now present the most interesting results from our experiments.

For the evaluation, we used standard metrics for classification accuracy, suitable for studying problems such as sentiment analysis. In particular, we used Precision and Recall, with the former being calculated as the ratio of the number of tweets correctly classified to a given class over the total number of tweets classified to that class, while the latter measuring the ratio of messages correctly classified to a given class over the number of messages from that class. Additionally, the F-score is the harmonic mean of precision and recall, expressed as \( F = \frac {2 \cdot P \cdot R}{P + R}\). For our particular case with three classes, P, R and F are computed for each class separately, with the final F value derived as the weighted mean of the separate F-scores: \(F=\frac {F_{N} \cdot N + F_{R} \cdot R + F_{S} \cdot S}{N+R+S}\); recall that \(N = 10889\), \(S = 3166\) and \(R = 1943\). The results are shown in Table 4, along with the reported results from state of the art approaches proposed by other researchers in the field. Note that the performance numbers P, R and F of the other state of the art approaches are based on the authors’ reported data in the cited works. Additionally, we report the performance of each individual LSTM classifier as if used alone over the same data (that is, without the ensemble logic). The F-score for our proposed approaches shown in the last column, is the weighted average value over the 3 classes (Neutral, Sexism, Racism). Moreover, all the reported values are average values produced for a number of runs of the same tested scheme over the same data. Figure 3 shows the F-Score as a function of the number of training samples for each ensemble of classifiers. We clearly see that the models converge. For the final run the F-score has standard deviation value not larger than 0.001, for all classifiers.

Table 4 Evaluation results. (The values highlighted in bold indicate the best performance)
Fig. 3
figure 3

Aggregated value for F-score vs the number of combined experiment runs

As can be seen in Table 4, the work in [25], in which character n-grams and gender information were used as features, obtained the quite low F-score of 0.7391. Later work by the same author in [24] investigated the impact of the experience of the annotator in the performance, but still obtaining a lower F-score than ours. Furthermore, while the second part of the two step classification in [19] performs quite well (reported an F-score of 0.9520) in detecting the particular class the abusive text belongs to, it nevertheless falls short in distinguishing hatred from non-hatred content in general. Finally, we observe that applying a simple LSTM classification in our approach, with no use of additional features (denoted ‘single classifier (i)’ in Table 4), achieves an F-score that is below 0.93, which is in line with other researches in the field [1]. Very interestingly, the incorporation of features related to user’s behavior into the classification has provided a significant increase in the performance vs. using the textual content alone, (F = 0.9295 vs. HCode \(F = 0.9089\)).

Another interesting finding is the observed performance improvement by using an ensemble instead of a single classifier; some ensembles outperform the best single classifier. Furthermore, the NRS classifier, which produces the best score in relation to other single classifiers, is the one included in the best performing ensemble.

In comparison to the approach in [11], which focuses on various classes of Sexism, the results show that our deep learning model is doing better as far as detecting Sexism in general, outperforming the FastText algorithm they have included in their experimental models (F = 0.87). The inferiority of FastText over LSTM is also reported in the work in [1], as well as being inferior over CNN [19]. In general, through our ensemble schemes it is confirmed that deep learning can outperform any NLP-based approaches known so far in the task of abusive language detection.

We also present the performance of each of the tested models per class label in Table 5. Results by other researchers have not been included, as these figures are not reported in the existing literature. As can be seen, sexism is quite easy to classify in hate-speech, while racism seems to be harder; Similar results were reported in the literature [6]. This result is consistent across all ensembles.

Table 5 Detailed results for every class label. (The values highlighted in bold indicate the best performance)

For completion, the confusion matrices of the best performing approach that employs 3 classifiers (ensemble viii) as well as of the ensemble of all 5 classifiers (xi), are provided in Table 6. The presented values is the sum over multiple runs.

Table 6 Confusion Matrices of Results for the best performing ensembles with 3 and 5 classifiers

To study the effect of the user’s tendency in hate-speech in the F-score, we provide a break-down of the computed values over five classes of users. Therefore, we divided the complete set of users into five subsets of equal size, wrt. their tendency on sexism or racism, and computed the F-score independently for each user class. We present the results for each individual classifier as well as for all the ensembles of classifiers we tested.

In Fig. 4 we present the F-score achieved for each classifier over the five classes of users. In class 1 belong those users having the lowest tendency, while class 5 contains the users with the highest tendency. As can be seen in the above figure, tweets by users who are more tempted to hate speech are easier to detect by our algorithm, than the less tempted ones. Very interestingly, this characteristic works better for sexism rather for racism. Quite impressive is the fact that the F-Score for the most tempted users can reach as high as 0.995, no matter which classifier was used. In addition, classifier (O), which does not make use of user features, performs slightly worse, for the full range of classes of tendency.

Fig. 4
figure 4

F-score for various classes of users for single classifiers

From the output shown in Fig. 4, we observe that the classification works quite effectively, detecting almost all cases of abusive content originated from the most tended users, something that is in line with our primary objective. Overall, the above observations confirm the original hypothesis of the classification accuracy being improved with the employment of additional user-based features into the prediction mechanism.

For completion, we also report the F-score for all the ensembles of classifiers for every particular users class in Figs. 5 and 6. As can be seen, for the case of ensembles, our approach has similar and equally good performance with that achieved by the use of individual classifiers. We also observe that, for the classes of users who are less tending towards sexism or racism, the 5-classifiers ensemble achieves the best performance in comparison to the other schemes.

Fig. 5
figure 5

F-score for various classes of users with tendency in racism for ensemble classifiers

Fig. 6
figure 6

F-score for various classes of users with tendency in sexism for ensemble classifiers

Another interesting result is presented in Fig. 7. It shows the Receiver Operative Characteristic (ROC) curves of all single classifiers we introduced. ROC values gives the ability to assess a classifier’s performance over its entire operating range of the chosen thresholds used for separating one class from another. Also, it provides visualization of the trade-offs between sensitivity and specificity, so that finally an optimal model can be selected. To compute the ROC curves for 3-class label output, we applied the following rationale: For each classifier scheme, we firstly take each prediction that is essentially the output of the softmax activation function, and then we apply, in separate for each class label value (Neutral, Sexism, Racism), a threshold to classify a tweet as belonging to that class. Next, we compute the True Positive Ratio and False Positive Ratio as a function of \(tpr=\frac {tp}{tp+fn}\) and \(fpr=\frac {fp}{fp+tn}\) respectively; and finally, the resulting values are averaged over the 3 classes of Neutral, Sexism and Racism. The above steps are repeated for a range of threshold values between (0.0 and 1.0) to produce the output finally demonstrated in the ROC curve for that classifier.

Fig. 7
figure 7

ROC comparison for all single classifiers

To express the resulting performance of a classifier in the form of numerical score we compute the Area Under Curve (AUC) value for each one (see Table 7). The figures show that NS is the best performing classifier achieving AUC value of 0.8406. While all the other single classifiers performed slightly worse, they still achieved a high score that falls within the range between 0.8 and 0.9, which is characteristic of a good performing model. Also computed the AUC values for each of the 5 classes of users with regards to their tendency in sexism or racism (see Table 7). The above results also confirm the optimal performance achieved by the model in the task of separating the hateful content from the non-hateful one, when the posting is originated from users belonging to a class of high tendency towards sexism or racism.

Table 7 Area under the curve (AUC) of ROC for single classifiers. (The values highlighted in bold indicate the best performance)

Finally, we need to point out that our approach does not rely on pre-trained vectors, which provides an important advantage when dealing with short messages of this kind. More specifically, users will often prefer to obfuscate their offensive terms using shorter slang words or create new words by ‘inventive’ spelling and word concatenation. For instance, the word ‘Islamolunatic’ is not available in the popular pre-trained word embeddings (Word2Vec or GloVe), even though it appears with a rather high frequency in racist postings. Hence, word frequency vectorization is preferable to the pre-trained word embeddings used in prior works, in order to build a language-agnostic solution.

6 Conclusions and future work

Automated detection of abusive language in on-line media has in recent years become a key challenge. In this paper, we have presented an ensemble classifier to detect hate-speech in short text, such as tweets. Our classifier uses deep learning and incorporates a series of features associated with users’ behavioral characteristics, such as the tendency to post abusive messages, as input to the classifier. In summary, this paper has made several main contributions in order to advance the state-of-the-art. First, we have developed a deep learning architecture that uses word frequency vectorisation for implementing the above features. Second, we have proposed a method that, due to no-use of pre-trained word embeddings, is language independent. Third, we have done thorough evaluation of our model using a public dataset of labeled tweets, an open-sourced implementation built on top of Keras. This evaluation also includes an analysis of the performance of the proposed scheme for various classes of users. The experimental results have shown that our approach outperforms the current state-of-the-art approaches, and to the best of our knowledge, no other model has achieved better performance in classifying short messages. Also, the results have confirmed the original hypothesis of improving the classifier’s performance by employing additional user-based features into the prediction mechanism.

In this section, we also discuss possible threats to vulnerability and limitations of our approach and give our perspectives on solving these issues. The stochastic behavior of the deep learning processes is the most important threat to construct validity, resulted in fluctuation in the F-score over the multiple runs (see Section 5.4). To overcome this, and thus ensure that our findings are valid, we ran every experiment multiple times and averaged the results. Concerning the generalizability of results, i.e. external validity, the experiment was performed on a single dataset of constant mixtures of labels and user profiles with a tendency towards a specific type of hatred language. However, our analysis of the dataset has shown that although we do not claim it is representative to all real-world data, its size and heterogeneity have been enough to test our method. Nevertheless, in our future study, we will further evaluate our approach over other datasets, including analyzing texts written in different languages. Further, we assumed that users behavior wouldn’t change over time, which can also be considered as a threat to internal validity. That is, in a real use-case, users would normally be given access to the classifier output, so that upon the submission of a tweet they would become aware of the history in the labels given to the previous ones. This normally means that they would adapt their behavior to the classification criteria, and very likely avoid the inclusion of hateful content in their future postings. This is a general challenge in many applications of classification techniques, which could be solved through longer-lasting user studies or inclusion of other sources of information. For this reason, in our future work, we plan to investigate other sources of information that can be utilized to detect hateful messages.