1 Introduction

Sentiment analysis and opinion mining task [1] is one of the well-studied fields in text mining and natural language processing. It aims at detecting and analyzing human opinions, attitudes, and emotions. Application scenarios of sentiment analysis can stem from product reviews [2], advertisement distribution, stock market [3, 4], social networks [5, 6] or even government intelligence [7].

Many research and famous datasets of sentiment analysis such as IMDB [8] and Yelp [9] acquaint positive or negative opinion as known as PNO [10] about a certain object from user comments. Contrariwise, we focus on the behavior of users by capturing verbal offense which potentially arouses negative feelings among other users. Recently, we built a de facto network comment dataset with ‘aggressivity’ label and adopted predictive models to detect verbal offenses [11]. The dataset included manually collected paragraphs and paragraphs from ‘Sentimen140’ [12] with labels renovated. Combined with WordNet [13] lemmatizer and Porter’s stemmer [14], support vector machine [15] and logistic regression [16] can achieve decent performance with F1-scores [17] greater than 0.80 on the 783 pieces of aggressive and unaggressive comments without any extensive hyper-parameter tuning. Despite those two methods achieve good results on our verbal offense dataset, we were looking for models that can outperform our previous ones for verbal aggression detection.

Convolutional neural networks (CNN) [18] are originally designed to process and learn information from image features by applying convolution kernels and pooling techniques which are widely adopted for extracting stationary features; for instance, CNN has shown its adaptability in the field of text mining and NLP tasks. Kim et al. reported series of experiments with CNNs [19] that achieve good results on sentence classification and sentiment analysis tasks. Lee et al. propose a weakly supervised CNN architecture [20] to identify discriminating keywords in PNO tasks. Inspired by the successful examples of CNN applications in the field of text classification, we introduce a CNN model to detect verbal offenses from the aggression dataset we collected in the previous research to look for performance enhancement.

The contribution of this work is to further improve the performance of the sentiment analysis task we previously proposed by introducing an efficient CNN-based deep learning model. In addition, by testing different kinds of models and methods, we discovered some interesting CNN architectures which can outperform others.

2 System modeling

2.1 Model architecture

The model architecture of CNN in the present paper is derived from Kim [19] and Lee et al. [20]. Motivated by those successful results attained in the aforementioned works, we design the network architectures by referencing former experiences. According to Lee’s statement, applying a large number of filters rather than deep architectures is good for text classification. We set 128 filters on the convolution layer, and each of them is 10 × 2 rectangular-shaped matching our 100 × 20 inputs. Also, after comparison experiments which will be discussed in the next section, we decide to use mean pooling rather than max pooling as an optional choice in the pooling layer. Finally, a two-layered multilayer perceptron [21] is introduced to be the classification model following the pooling layer. Summarizing these aspects, our model structure is shown in Fig. 1 and detailed settings will be shown in the experiments. Further justification of the architecture will be discussed in the Discussion section.

Fig. 1
figure 1

Model structure of the proposed convolutional neural networks

2.2 Word features

The model mentioned above use word embeddings techniques like Word2Vec [22] or Glove [23] to represent word features. Trained by a self-supervised model, words are transformed into similar vectors if they have similar verbal meanings. However, these approaches highly rely on the article context. Our dataset contains short passages from Twitter and usually contains short sentences in each document. Sentiment analysis on short texts is a difficult business [24]. This organization of data makes Word2Vec and Glove hard to learn information of each word by its ‘context.’ The situation that our task does not favor context-based pre-trained embedding methods will be shown afterward in the case study part. Another word embedding method presented in Gal et al. [25] is a method that treats word embedding on-the-fly. Encoded one-hot word vectors are linearly combined by an embedding layer whose weights are updated by backpropagation in the network. This is the method adopted in our experiment.

Alternatively, we apply TF-IDF [26] matrix as document-level features in the present work. In such short paragraph text classification problem, we think that word occurrence statistics are appropriate solutions which contain enough information although TF-IDF is a traditional method. Furthermore, to better utilize the CNN property, the 1-dimensional vector is transformed to a 2-dimensional matrix, and hence, CNN filters can convolve word features in a larger field as shown in Fig. 2. Features in 1-dimensional only share the same weight with other features within the window size, while 2-dimensional features can share weights with features in the next row which are unreachable in 1-dimensional case. Theoretically, 2-dimensional features with 2-dimensional kernel can capture features from a larger space than single dimensions.

Fig. 2
figure 2

Comparison between 1-D feature and 2-D feature, each element in the vector/matrix is an attribute from TF-IDF document representation

3 Experiment settings

3.1 Dataset and preprocessing

As it is shown in our previous paper [11], the number of features after document encoding (regardless of lemmatizing methods) is around 2000 tabulated in Table 1. With such dictionary size, we limit the maximum number of features to 2000 features as sorted by their TF-IDF weights (ascending order). To obtain the optimal solution, different word tokenizing methods (Porter and WordNet) are compared.

Table 1 Feature numbers tokenized by non-lemmatized, WordNet lemmatizer and Porter’s stemmer

Apart from the dictionary size, word count is another import statistic of the dataset. Figure 3 is a histogram to show the relation between word count and document count. Under the constraint of 140 words in Twitter, most comments in the dataset contain less than 40 words. Few of them contain 60–80 words, while fewer contain around 100 words. This property will affect the training result of Word2Vec which will be discussed in the case studies.

Fig. 3
figure 3

Word count distribution of documents in the dataset

3.2 Learning algorithms and models

To verify the efficiency of the presented model, baseline models from previous work and other typical frameworks are introduced to the experiment. Settings of those models to be examined are listed in Table 2.

Table 2 Hyper-parameters settings of different models

Firstly, the CNN models mentioned in the previous sections which are capable of both TF-IDF matrix and word embeddings are introduced to the experiment. Except for the proposed CNN architecture, two algorithms that can achieve desirable results in the previous research [11], support vector machine and logistic regression with stochastic gradient descent, are included in the experiment. These two methods are baselines for comparison in this study. Thirdly, a modification of recurrent neural network (RNN) [27] called long short-term memory also known as ‘LSTM’ [28] is also tested for the widely successful applications of recurrent networks. Mikolov et al. [29] introduce recurrent network to language model. Yang et al. [30] design an infrastructure that capture sentence and word-level attentions using gated recurrent units [31] which is also a modified LSTM. Based on these successful attempts, LSTM is also tested in our experiment using settings indicated in Table 2.

3.3 Performance benchmarking

Performance benchmarking is held based on holdout validation method [32] which means the original dataset is randomly divided into training set and test set. In our experiment, 60% of the original data is used to train the model, while 40% of data is used for testing. Performance metrics are accuracy and area under the curve (AUC) of the receiver operating characteristic (ROC) [33].

4 Results

The results of our present approaches against some other experiments are shown as follows. Table 3 and Fig. 4 depict the accuracy values and AUC values of different measures adopted in the experiment. Deep learning algorithms are expected to gain better performance as evidenced by many other similar studies. Compared with the baseline methods we experimented in the previous research, LSTM model is proved to be enhanced with 0.91 as accuracy and 0.96 as Macro-AUC. By contrast, the proposed CNN model with 2-dimensional TF-IDF matrix results in further improvement with accuracy equals to 0.92 and Macro-AUC equals to 0.98.

Table 3 Performances of different models (SVM, logistic, LSTM, CNN)
Fig. 4
figure 4

ROC curves of the convolutional neural network (CNN + 2D TF-IDF, max pooling), long short-term memory (LSTM), support vector machine (SVM) and logistic regression (logistic)

Apart from numeral performance benchmarks, Fig. 4 depicts the ROC curves of different methods for illustrations. Magenta dash line is a microaverage curve, navy blue dash line denotes macroaverage [34] curves, while cyan and yellow are curves with respect to each class. The curves stay near the upper-left corner, implying that the corresponding model achieves good performance. We can see that not only accuracy but also ROC curves demonstrate that CNN with 2D TF-IDF matrix is a superior solution to our problem.

4.1 Embedding layer and TF-IDF

Another alternative for feature construction is to introduce embedding layer proposed in [25] by adding a lookup table also known as one-hot word vectors and an embedding weights layer. Document matrices are represented by the concatenation with paddings of word vectors. Embedding approach enables us to learn in word level rather than document level. Table 4 and Fig. 5 show some statistic and analysis of embedding experiments.

Table 4 Performances of introducing embedding layer to CNN and LSTM
Fig. 5
figure 5

ROC curves of introducing embedding layer to CNN (up) and LSTM (down)

As shown in the table and figures above, the accuracy of CNN drops from 0.92 to 0.83, while the accuracy of LSTM drops from 0.91 to 0.72. We can be informed that, although embedding method reserves word level and ordinal information, it compromises the prediction performance. Both CNN and LSTM with embedding layer output worse performance compared with TF-IDF feature at the document level.

4.2 WordNet and Porter

‘WordNet’ and ‘Porter’ are two stemming methods to group words by their lexical roots: ‘WordNet’ group words by meanings while ‘Porter’ group word by word roots. In the previous paper, we tested two methods on support vector machine and logistic regression and concluded that on these two models. ‘WordNet’ is an optimal solution of stemming. To our surprises, the present CNN model reveals an opposite conclusion shown in Fig. 6 and Table 5. According to the figure and table, Porter’s stemmer outputs 0.85 accuracy and 0.94 Macro-AUC which are 7 and 4% less than 0.92 and 0.98, respectively, using WordNet. Porter’s stemming is a more aggressive scheme than WordNet and lose some information in when doing word tokenizing.

Fig. 6
figure 6

ROC curves of WordNet lemmatizing (up) and Porter’s stemmer (down)

Table 5 Performances of WordNet lemmatizing and Porter’s stemmer

4.3 Model generalization analysis

In this section, we discuss the generalization of the model using learning curve. The learning curves are generated by increasing the size of the training set. If testing score decreases along with the increment of training set size, the model tends to suffer from the overfitting problem. According to Fig. 7, the testing accuracy keeps increasing and both accuracies converge at a value larger than 0.9. Hence, the learning curve demonstrates that the network achieves decent testing performance as well as a good ability of generalization.

Fig. 7
figure 7

The learning curves of the proposed CNN model, lower (upper) bound of the filled area is the minimum (maximum) score within 15 epochs

5 Discussion

5.1 Limitations of pre-trained embedding

In this section, we demonstrate a case study on the Word2Vec method presented by Mikolov et al. [22]. Before training the proposed models for experiments, we have finished the word vector pre-trained procedure and observed the generated word vectors. We examined a set of word vectors and found that most of the pre-trained word vectors did not preserve the desired properties of Word2Vec: consistency between vector similarity and word similarity.

Some sample words and their nearest neighbors ordered by similarity are shown in Table 6. We select words which are positive (successful, happy), negative (worst, lose) and neutral (everyone). Many of the words are not ‘well trained’ and failed to reserve word meanings. Short paragraph and small dataset (shown by Fig. 3) hinder the performance of Word2Vec training which highly relies on context no matter by the continuous bag of words (predict word by context) or skip gram (predict context by word) [22]. The limitation of the pre-trained model is shown in our dataset.

Table 6 Similar word samples

5.2 Visual analysis of convolution layer and pooling layer

After benchmarking of the model, some case studies on our CNN model are carried out. One property of CNN concerns about applying convolution operations with trained kernels on the original feature matrix to filter out new feature sets. Our 2D TF-IDF matrix can be regarded as a single channel picture with 100 × 20 pixels. To better observe the behavior of CNN, we construct a dictionary document (a document contains every single word once excluding stop words in the dictionary) and see how. The TF-IDF transformed dictionary is shown in Fig. 8. Attributes with higher TF-IDF scores are highlighted in yellow color in the picture. Otherwise, attributes with low TF-IDF scores are purple pixels. Shown by the dictionary figure, features in the TF-IDF matrix are ranked by word counts inside the field of the picture. Intuitively, a dictionary is a color gradient bar change from purple to yellow with a small amount of noise because a word with a larger number of counts does not pick up some examples from guaranteed words to gain a higher TF-IDF score. The picture of the processed dictionary convolved by 128 trained filters is shown in Fig. 8, while some samples before and after going through the pooling layer are also shown.

Fig. 8
figure 8

Dictionary as input feature (left), some convolved features (upper right) and some pooled features (lower right)

We conduct a case study to observe pooling effect in terms of text classification. Figure 8 shows features before pooling in the first row while the features after pooling in the second row. Visually, the picture of the feature after pooling is ‘brighter’ which may imply feature values are amplified after pooling. We also conclude some patterns and pick up some examples of the processed features. The second leftmost column has sparse active features only in some region of the matrix, while most of the values are close to zero. This pattern implies that the convolution layer ‘filters out’ crucial features. Another pattern shown in the third column is that convolution layer ‘wipes out’ some features in a certain region (bottom from the picture) from original features. The third discovered pattern in the fourth column is that the outputted feature is almost identical to the original one except ‘diluting’ the original values. Finally, the patterns from the rightmost column ‘highlight’ attributes from a certain region while minimizing feature values outside the region. From the scope of the convolved dictionary, we can be implied that features are enriched by the convolution layer and all these new implicit features will be flattened and go through the training process of two fully connected neural network.

To further explore the functionality of pooling layer, we perform experiments by replacing average pooling layer by max pooling. Figure 9 shows plots of average pooling output and max pooling output; we noted that these outputs came from different training process. Comparing max pooling with average pooling, max pooling outputs more features with zero values. We assume that, with richer features, the average pooling should output superior results than max pooling.

Fig. 9
figure 9

Overview of the 128 trained filters: averaged pooling layer (up) contains more nonzero values, while max pooling layer (down) contains more zero values

Nonetheless, the scope of performance benchmarking in Table 7 shows a contrary result. The max pooling outputs 0.92 accuracy, 0.98 Micro-AUC and 0.97 Macro-AUC, while average pooling outputs 0.92, 0.96 and 0.95, respectively. The max pooling layer in our experiment shows stronger ability to filter out necessary features to avoid overfitting than the average pooling layer.

Table 7 Performances of max pooling and average pooling

Using the analysis method above, we sample one aggressive sentence and one non-aggressive sentence from the dataset with some measurements shown in Table 8 and plot the visualized aggressive sentence in Fig. 10. Since the example sentence covers a small set of words in the whole dictionary, the input features are sparse since only a few areas in the plots are colored.

Table 8 Examples of verbal aggression detections and two most informative words
Fig. 10
figure 10

Aggressive sentence (sentence 1 in Table 8) as input feature (left), some convolved features (upper right) and some max pooled features (lower right)

According to Table 8, bold words are the most informative ones, while italic words are the second. Words that implicit the aggressive sentiment in each sentence obtain high scores using our feature construction approach. Then, the convolutional layer and the pooling layer further extract and filter these features. These functions of the two layers are reflected by the alteration of the colored region shown by the plots in Fig. 10.

5.3 Architecture design experiments

The proposed network architecture is rooted from Kim [19] which trains a convolutional neural network with one layer of convolution. To examine the design principle, we test different architectures with alternative numbers of convolutional layers (followed by max pooling layers) and dense layers. The design of architecture is based on the statistics shown in Table 9, and our proposed neural network with 1 convolutional layer and 2 dense layers achieves the best accuracy among all experiment settings.

Table 9 Accuracies of models with different architecture

6 Conclusion

In this paper, we present a new solution to the verbal aggression detection task we aroused in the prerequisite research based on convolutional neural networks (CNN) using 2-dimensional TF-IDF features and observe significant improvement. Firstly, experimental results indicate that CNN model achieves significant improvement compared with the baseline SVM and logistic regression methods in the previous study as well as the newly tested LSTM model in the problem. Moreover, we carried out experiments on the dataset to explain the selection of word lemmatizing. Finally, the problem that pre-trained word vector method encountered on our dataset is annotated and the preference of pooling strategies is studied by conducting visual analysis on neural layers.