Keywords

1 Introduction

Satirical news is a type of entertainment that employ satire to criticize and ridicule, in a humorous way, the key figures from society, socio-political points, or notable events [27, 38]. Although it does not aim to misinform, it mimics the style of regular news. Therefore, it has a sizeable deceptive potential, driven by the current increase in social media consumption and the higher rates of distrust in official news streams [20].

Furthermore, sentiment analysis is regarded as a successful task in determining the opinions and feelings of people, especially in online shops where customer feedback analysis can lead to better customer service [37]. Limited resources in languages such as Romanian make it challenging to develop large-scale machine learning systems since the largest datasets present up to tens of thousands of examples [27]. Therefore, various techniques should be proposed and investigated to address these challenges on such datasets.

Adversarial training is an effective defense strategy to increase the robustness and generalization of the models intrinsically. Introduced by Szegedy et al. [33] and analyzed by Goodfellow et al. [8], adversarial examples are augmented data points generated by applying a small perturbation to the input samples. It was initially employed in computer vision, where input images were altered with a small perturbation [8, 18, 36]. More recently, adversarial training gained popularity in NLP. The text input is a discrete signal; therefore, the perturbation is applied to the word embeddings in a continuous space [22]. The application of adversarial training in our experiments is motivated by the potential to improve the robustness and generalization of models with limited training resources.

This paper aims to introduce robust high-performing networks employing adversarial training and capsule layers [28] for satire detection in a Romanian corpus of news articles [27] and sentiment analysis for a Romanian dataset [34]. Our experiments include training models suitable for NLP tasks as follows: Convolutional Neural Networks (CNNs) [12], Gated Recurrent Units (GRUs) [3], Bidirectional GRUs (BiGRUs), CNN-BiGRU, Long Short-Term Memory (LSTM) [10], Bidirectional LSTM (BiLSTM), and CNN-BiLSTM. Starting from Zhao et al. [41], we compare the networks against their adversarial capsule flavors. Next, the best-performing network is subjected to an in-depth analysis concerning the impact on the performance of the capsule model and the training with adversarial examples. Thus, we test the effect of capsule hyperparameters varying the number of primary and condensed capsules [41]. Also, we assess the performance of our model employing Romanian GPT-2 (RoGPT-2) [24] for data augmentation up to 10,000 text continuation examples. Finally, we discuss several misclassified test inputs for the sentiment analysis task.

The main contributions in this work are as follows: (i) we thoroughly experiment with various configurations to assess the performances of the investigated approaches, namely adversarial augmentations and capsule layers; (ii) we show that the best-performing model uses BiGRU with capsule networks, while the most improvements were seen when incorporating RoGPT-2-based augmentations; (iii) we investigate the effects of analyzed components through t-SNE plots [17] and ablation studies; and (iv) we achieve state-of-the-art results on the two Romanian datasets.

2 Related Work

2.1 Capsule Networks in NLP

Firstly presented by Sabour et al. [28], the capsule neural networks are machine learning systems that model hierarchical relationships regarding object properties (such as pose, size, or texture) in an attempt to resemble the biological structure of neurons. Among other limitations, capsule networks are addressing the max pooling problem of the CNNs, which allows for translation invariance, making them vulnerable to adversarial attacks [15]. While it has been demonstrated that capsule networks are successful in image classification [28], there is also a general preference for exploring their potential in NLP tasks, especially in text classification. Several works [11, 42] took the lead in this topic, showing that using different approaches, such as static and dynamic routing, the capsule models provided competitive results on popular benchmarks.

Several studies were performed in topic classification and sentiment analysis using capsule networks. Srivastava et al. [30] addressed the identification of aggression and other activities, such as hate speech and trolling, using a model based on the dynamic routing algorithm [42] involving LSTM as a feature extractor, two capsule layers (namely, a primary capsule layer and a convolutional capsule layer), and finally, the focal loss [16] to handle the class imbalance. The resulting model outperformed several robust baseline algorithms in terms of accuracy; however, a more complex data preprocessing was expected to improve the results further.

For the sentiment analysis task, Zhang et al. [40] proposed CapsuleDAR, a capsule model successfully combined with the domain adaptation technique via correlation alignment [32] and semantic rules. The model architecture consisted of a base and a rule network. The base network employed a capsule network for sentiment prediction, consisting of several layers: embedding, convolutional, capsule, and classification. The rule network involved a rule capsule layer before the classification layer. Extensive experiments were conducted on review datasets from four product domains, which showed that the model achieved state-of-the-art results. Additionally, their ablation study showed that the accuracy decreased sharply when the capsule layers were removed.

Su et al. [31] tackled limitations of Bidirectional Encoder Representations from Transformers (BERT) [4] and XLNet [39], such as local context awareness constraints, by incorporating capsule networks. Their model considered an XLNet layer with 12 Transformer-XL blocks on top of which the capsule layer extracted space- and hierarchy-related features from the text sequence. Experiments illustrated that capsule layers provided improved results compared with XLNet, BERT, and other classical feature-based approaches.

Moreover, Saha et al. [29] introduced a speech act classifier for microblog text posts based on capsule layers on top of BERT. The model took advantage of the joint optimization features of the BERT embeddings and the capsule layers to learn cumulative features related to speech acts. The proposed model outperformed the baseline models and showed the ability to understand subtle differences among tweets.

2.2 Romanian NLP Tasks

In recent years, several datasets have emerged aiming to improve the performance of the learning algorithms on Romanian NLP tasks. Apart from the two datasets used in this work, researchers have also introduced the Romanian Named Entity Corpus (RONEC) [6] for named entity recognitionFootnote 1, the Moldavian and Romanian Dialectal Corpus (MOROCO) [2] for dialect and topic classification, the Legal Named Entity Recognition corpus (LegalNERo) [26] for legal named entity recognition, and the Romanian Semantic Textual Similarity dataset (RoSTS)Footnote 2 for finding the semantic similarity between two sentences.

Lately, the language model space for Romanian was also improved with the introduction of Romanian BERT (BERT-ro) [5], RoGPT-2, ALR-BERT [23], and DistilMulti-BERT [1]. In addition, all the results for these systems have been centralized in the Romanian Language Leaderboard (LiRo) [7], a leaderboard similar to the General Language Understanding Evaluation (GLUE) benchmark [35] that tracks over ten Romanian NLP tasks.

3 Datasets

In this work, we rely on two of the most recent Romanian language text datasets: a corpus of news articles, henceforth called SaRoCo [27], and one composed of positive and negative reviews crawled from a Romanian website, henceforth called LaRoSeDa [34].

3.1 Satirical News

SaRoCo is one of the most comprehensive public corpora for satirical news detection, eclipsed only by an English corpus [38] with 185,029 news articles and a German one [20] with 329,862 news articles. SaRoCo includes 55,608 samples, of which 27,628 are satirical and 27,980 are non-satirical (or regular). Each sample consists of a title, a body, and a label. On average, an entire news article has 515.24 tokens for the body and 24.97 tokens for the title. The average number of sentences and words per sentence are 17 and 305, respectively. The labeling process is automated, as the news source only publishes satirical or regular content.

3.2 Product Reviews

LaRoSeDa is one of the largest corpora for sentiment analysis in the Romanian language. It was created based on the observation that the freely available Romanian language datasets were significantly reduced in size. This dataset totals 15,000 online store product reviews, either positive or negative, for which the ratings were also collected for labeling purposes. Thus, assuming that the ratings might reflect the polarity of the text, each review rated with one or two stars was considered negative. In contrast, the four or five-star labels were considered positive. The labeling process resulted in 7,500 positive reviews (235,474 words) and 7,500 negative reviews (304,813 words). The average number of sentences and words per review is 4 and 36, respectively.

4 Methodology

The generic adversarial capsule network we employ is presented in Fig. 1. It consists of a sub-module that can represent any widely-used NLP model, followed by capsule layers. Concretely, we use primary capsules and capsule flattening layers to facilitate the projection into condensed capsules passed as input for a routing mechanism to obtain the class probabilities. To increase robustness, we feed regular and adversarial samples into the model. In what follows, we detail the employed components.

Fig. 1.
figure 1

Our generic adversarial capsule architecture, where \(E_d\) denotes the embedding size, \(N_s\) is the number of sentences, \(N_w\) is the number of words per sentence, \(N_{pc}\) is the number of primary capsules, \(N_{cc}\) is the number of condensed capsules, and \(N_{cls}\) is the number of classes to which the routing algorithm will converge.

Word Embeddings. Each word is associated with a fixed-length numerical vector, allowing us to express semantic and syntactic relations, such as context, synonymy, and antonymy. Depending on the model, the embedding representation has various sizes.

To use a continuous representation of the input data, we employ two different types of embeddings: BERT- and non-BERT-based. On the RoBERT model [19], we rely on embeddings delivered by the model with a dimension \(E_d=768\), whereas, for the non-BERT models, we abide by Onose et al. [25] in terms of distributed word representations and choose Contemporary Romanian Language (CoRoLa) [21] with an embedding dimension \(E_d=300\), Nordic Language Processing Laboratory (NLPL) [14], having the size \(E_d=100\), and Common Crawl (CC) [9] with \(E_d=300\).

Adversarial Examples. To increase the robustness of our networks, we create adversarial examples by replacing characters in words. Using the letters of the Romanian alphabet, we randomly substitute one character per word, depending on the sentence size: one replacement for less than five words per sentence, two replacements for 5 to 20 words per sentence, and three replacements for more than 20 words per sentence.

Primary Capsule Layer. This layer transforms the feature maps obtained by passing the input through the sub-module into groups of neurons to represent each element in the current layer, enabling the ability to preserve more information. By using \(1 \times 1 \) filters, we determine the capsule \(\boldsymbol{p}_i\) from the projection \(p_{ij}\) of the feature maps [41]:

$$\begin{aligned} \boldsymbol{p}_i = squash(p_{i1} \oplus p_{i2} \oplus \cdot \cdot \cdot \oplus p_{id}) \in \mathbb {R}^d \end{aligned}$$
(1)

where d is the primary capsule dimension, \(\oplus \) is the concatenation operator, and \(squash(\cdot )\) adds non-linearity in the model:

$$\begin{aligned} squash(\boldsymbol{x}) = \frac{ \Vert \boldsymbol{x}\Vert ^2 }{1 + \Vert \boldsymbol{x}\Vert ^2 }\frac{\boldsymbol{x}}{\Vert \boldsymbol{x}\Vert } \end{aligned}$$
(2)

Compression Layer. Because it requires extensive computational resources in the routing process (i.e., the fully connected part of the capsule framework), we need to reduce the number of primary capsules. We follow the approach proposed by Zhao et al. [41], which uses capsule compression to determine the input of the routing layer \(\boldsymbol{u}_j\). Each condensed capsule \(\hat{\boldsymbol{u}}_j\) represents a weighted sum over all the primary capsules:

$$\begin{aligned} \hat{\boldsymbol{u}}_j = \sum _{i} b_i \boldsymbol{p}_i \in \mathbb {R}^d \end{aligned}$$
(3)

Routing Layer. It conveys the transition layer between the condensed capsules to the representation layer. It is denoted by a routing method to overcome the loss of information determined by a usual pooling method. In our capsule framework, we choose Dynamic Routing with three iterations [28].

Representation Layer. In the binary classification tasks, the last slice of our generic architecture is represented by the probability of a text input being satirical or regular for SaRoCo and positive or negative sentiment for LaRoSeDa.

5 Experimental Setup

5.1 Model Parameters

Firstly, we use CoRoLa, CC featuring 300-dimensional, and NLPL with 100-dimensional state space vectors for reconstruction at the embeddings level. We choose n-gram kernels with three sizes (i.e., 3, 4, and 5) and 300 filters each for the CNN sub-module. Also, for the Capsule layers, we use \(N_{pc}=8\) primary capsules and \(N_{cc}=128\) condensed capsules, which we fully connect through Dynamic Routing and obtain \(N_t\) lists with \(N_{cls}\) elements. For each element in the list, the argument of the maximum value represents the predicted label, where “1" is a satirical text or a positive review, whereas “0" is a non-satirical text or a negative review. Secondly, for the GRU and LSTM sub-modules, we employ one layer and a hidden state dimension of 300 for both unidirectional and bidirectional versions. Finally, for the RoBERT model, we choose the base version of the Transformer with vector dimensions of 768, followed by a fully connected layer with the size of 64, \(\tanh \) activation function, and a fully connected layer with \(N_{cls}\) output neurons.

5.2 Training Parameters

The number of texts chosen from SaRoCo is \(N_t=30,000\) (15,000 satirical and 15,000 non-satirical) with a maximum \(N_s=5\) sentences per document and \(N_w=60\) words per sentence. For LaRoSeDa, we use 6,810 positive and 6,810 negative reviews for training, with \(N_s=3\) sentences per document and \(N_w=60\) words per sentence. The optimizer is Adam [13], and the loss function is binary cross-entropy. We set the learning rate to \(5e-5\) with linear decay and train for 20 epochs. The batch size is 32, and the train/validation/test split is 70%/20%/10%.

6 Results

This section presents the performance analysis of our models from quantitative and qualitative perspectives, as well as a comparison with previous works for the chosen datasets.

Initial Results. Table 1 shows our results on the SaRoCo and LaRoSeDa datasets. The experiments with varying embeddings other than RoBERT (i.e., CC, CoRoLa, and NLPL) show that NLPL determines better performance overall. This was unexpected because CoRoLa covers over one billion Romanian tokens, while CC and NLPL contain considerably fewer tokens. For the SaRoCo dataset, the best model on the CC embeddings uses the BiGRU sub-module, achieving a 95.80% test accuracy. For the CoRoLa corpus, the GRU and BiGRU sub-modules perform equally, resulting in a 95.77% test accuracy. Also, the best NLPL embedding model considers the BiGRU sub-module, scoring a 96.15% test accuracy. On the LaRoSeDa dataset, we find the best model obtaining a 96.06% test accuracy based on GRU with NLPL embeddings. Moreover, training on the RoBERT embeddings brings the highest performance when combined with the BiGRU sub-module, achieving a test accuracy of 98.32% on SaRoCo and 98.60% on LaRoSeDa.

The score differences between our results on the two datasets are less than 0.5%. Therefore, a performance difference is expected due to the more considerable proportion of data for SaRoCo. Thus, there is no concrete insight into whether the satire detection task is more complex than the sentiment analysis one, especially in the binary classification setup. Still, since the training set size for LaRoSeDa is considerably smaller than that of the SaRoCo one, the slight performance difference shows polarization support on sentiment analysis.

We further assess the feature representation quality for each sub-module using the two-dimensional t-SNE visualizations upon the best-performing training results. Figure 2 shows different clustering representations in most cases. For the SaRoCo dataset, the best delimitation is observed on the BiGRU sub-module, which is validated by the best performance achieved for the NLPL embeddings as shown in Table 1. A similar effect applies to the BiGRU sub-module trained and evaluated on LaRoSeDa. Considering these results, the next set of experiments is performed based on the higher performance achieved with and without BERT embeddings, namely, the BiGRU sub-module with RoBERT and NLPL embeddings, respectively.

Fig. 2.
figure 2

t-SNE plots for each sub-module from the best-performing adversarial capsule network. The first row depicts the evaluation on SaRoCo, where blue indicates negative sentiment and orange represents positive one. The second row is for LaRoSeDa, where blue is for the non-satirical text, and orange is for the satirical one. The higher density on SaRoCo is because of a larger test dataset.

Table 1. Accuracy (Acc) of the generic adversarial capsule network with different word embeddings and sub-modules.

Comparison to Existing Methods. The results of Rogoz et al. [27] on the SaRoCo dataset show a more than 25% gain for our models compared to the BERT-ro approach, while our models outperform the character-level CNN by more than 29%. Human performance is a notable figure in deciding whether a selection of 200 news articles extracted from the dataset is satirical. Rogoz et al. [27] explored the idea, involving ten human annotators and indicated that the human performance is at 87.35% accuracy. Our approach surpasses this result by more than 11%. In addition, the results shown by Tache et al. [34] on the LaRoSeDa dataset prove the competitive performance of our proposed approach. Thus, our results are 7–8% higher than their best model, HISK+BOWE-BERT+SOMs, which comprises histogram intersection string kernels, bag-of-words with BERT embeddings, and self-organizing maps.

Table 2. Accuracy for various capsule hyperparameters.

Capsule Hyperparameter Variation. Fig. 1 depicts the hyperparameters of the capsule layers of our generic network, represented by \(N_{pc}\) (i.e., the number of primary capsules) and \(N_{cc}\) (i.e., the number of condensed capsules). We test the impact of these hyperparameters on the BiGRU sub-module with NLPL embeddings. We present the average for three runs per experiment. The chosen values for the hyperparameters are \(N_{pc}=\{2, 8, 32\}\) and \(N_{cc}=\{32, 128, 256\}\) (see Table 2).

During experiments, we observed that large values for \(N_{pc}\) considerably impact the training time. This is mainly due to the operations over high-dimensional matrices in the \(squash(\cdot )\) function from the iterative Dynamic Routing algorithm (see Eq. 2). Results from Table 2 support the intuition that a larger \(N_{pc}\) would bring better results. The model trained on SaRoCo with \(N_{pc}=32\) achieves the highest accuracy of 96.17%; nevertheless, the difference between choosing 8 and 32 is minimal. For SaRoCo and LaRoSeDa, the best overall performance is achieved in a setting with \(N_{cc}=128\), attaining accuracy scores of 96.02% and 95.46%, respectively. Based on both sets of results, we note that, for better performance, a hyperparameter search should be extended to the capsule hyperparameters.

Ablation Study. Motivated by the noteworthy closeness in performance between the BiGRU-based models with NLPL and RoBERT embeddings, respectively, we perform an ablation study, slicing the generic model into four categories: baselines (i.e., NLPL-BiGRU and RoBERT-BiGRU), adversarial (Adv), Capsule, and Adv+Capsule. The best results on the test datasets are brought by the most complex models in terms of training and architecture, with a 96.02% test accuracy for SaRoCo and a 95.82% test accuracy for LaRoSeDa using the NLPL embeddings, as well as a 98.30% test accuracy for SaRoCo and a 98.61% test accuracy for LaRoSeDa using the RoBERT embeddings (see Table 3).

Fig. 3.
figure 3

t-SNE plots on embedding space for each model from the ablation study.

Table 3. Ablation study.

Regarding model complexity, we determine that except for the adversarial training on a baseline BiGRU model, the performance improves when capsule layers are added on top of it, irrespective of including the perturbed data in training. The increase in performance on the SaRoCo dataset with our model is by 0.45% for the NLPL embeddings and by 0.10% for the RoBERT embeddings. We observe a decrease of 2.73% when the most undersized model (i.e., NLPL-BiGRU) is compared with the most complex one (i.e., RoBERT-BiGRU+Adv+Capsule). For the LaRoSeDa dataset, we gain 1.18% using the NLPL embeddings and 0.45% with the RoBERT embeddings, respectively. Also, the test accuracy difference between the most complex and the most undersized models is 3.97%, determining that the network conveys more value for the sentiment analysis task.

The two-dimensional t-SNE embeddings depicted in Fig. 3 show the contrast between the capsule- and non-capsule-based models. The embeddings obtained with the BiGRU alone feature a specific chained distribution, with clusters defined by halving the sequence. The RoBERT embeddings convey a similar partition. In contrast, the capsule networks will mostly feature well-separated embedding clusters. No significant embedding change occurs when adversarial training is included.

Table 4. Results for RoBERT-BiGRU augmented with RoGPT-2 data in terms of precision (P), recall (R), and accuracy (Acc).
Table 5. Examples from LaRoSeDa predicted with RoBERT-BiGRU. Ground truth (GT), Predicted (Pred) and Human labels are shown. P stands for Positive, N for Negative, and I for Indecisive.

Data Augmentation. Next, we incorporate the RoGPT-2 text continuation examples on a set of samples using two strategies for the decoder (i.e., greedy and beam-search-2). We perform experiments with the RoBERT-BiGRU model and show that the generative effort increases the overall performance for both tasks (see Table 4). In most cases, the RoBERT embeddings bring increased performance on the LaRoSeDa dataset as a consequence of the polarized effect of the product reviews, being strongly positive or negative. This polarization impact also applies to the models trained on augmented data. Data augmentation using the greedy decoder method achieves the best performance on SaRoCo, with a 99.08% test accuracy, employing 10,000 expanded texts, compared with the best accuracy of 98.68% obtained with beam-search-2. Furthermore, on LaRoSeDa, we determine similar performance on the greedy search algorithm with the best accuracy of 98.94% for 10,000 augmented texts. However, for the second dataset, more generated data will not necessarily determine the best performance as in the beam-search-2 scenario, using 10,000 augmented texts slightly underperforms in contrast with 5,000 examples.

Discussions. RoBERT-BiGRU, augmented with RoGPT-2 samples, correctly classifies 1,344 out of 1,362 examples from the LaRoSeDa test dataset. Due to spatial constraints, Table 5 depicts only the shortest eight misclassified texts out of 18, for which ground truth, predicted, and human annotated labels are shown. Two human annotators concluded from these examples that three indecisions and five classifications contradict the expected ones. The uncertain results and the negative misclassifications are expected to have been 3-out-of-5 stars ratings, which were assumed negative when the dataset was created. Furthermore, we observe strongly positive texts such as “I like it. A feminine bracelet that does its job well", “I was very satisfied with it", “happy about the product", “I recommend it", and “pleased! it is a very good clear sound!" have negative ground truth in the dataset. However, these are positive examples for the model and human annotators. Thus, we determine noise in the LaRoSeDa dataset, which is expected for datasets gathered from online sources, as the origin of the noise can be introduced by the page user or by automated data extractors.

7 Conclusions

Satire detection and sentiment analysis are important NLP tasks for which literature provides an ample palette of models and applications. Despite the more polarization expected on the product review task in contrast with the increased passivity of satirical texts, our models properly encapsulate the meaning represented by relevant features. In the syntactic and semantic context of our tasks, there is a slight difference in performance for the CC, CoRoLa, and NLPL embeddings, whereas fine-tuning the pre-trained RoBERT model brings up to 3% performance improvement. We showed in many experiments that our parameterized capsule framework can be adapted to specific problems. Moreover, we can improve the capsule network by employing data augmentation using generative models such as RoGPT-2, achieving a maximum gain of 0.6%. Based on our results, the potential of such an architecture is of increased significance, thus enabling further work in this direction.