Keywords

1 Introduction

The increased usage of social media and news consumption over the internet has helped in spreading fake news. Therefore, fake news has already had effects on political processes [17]. Even though a clear definition of the term fake news is not yet decided, automatic fake news detection with machine learning techniques can help users to identify signs of deception easier [22]. On the contrary, expert-based fact-checking needs many resources and is time-consuming, therefore it is an important goal to develop automatic machine learning algorithms [11]. For content-based fake news detection, the Transformer models seem to be a promising approach, which were introduced by Vaswani et al. [40]. Research, using transfer learning, has already outperformed methods, based on state-of-the-art results, in numerous NLP downstream tasks [8, 18, 21, 42]. Due to insufficient comparative results, the goal of this work is to show to which extent pre-trained language models are useful for content-based fake news detection and whether they gain promising results in predicting the classification of body texts and titles of news articles.

The paper is structured as follows. In Sect. 2, we give a brief overview of the definition of fake news. Afterwards, in Sect. 3, we discuss the previous work and state-of-the-art language models, followed by the related work in content-based fake news detection via Transformers. Section 4 describes the methodology, data and preprocessing steps. We illustrate the conducted experiments, results and evaluations in Sect. 5 and 6. We conclude our paper with a summary of the main contributions and give suggestions for future work.

2 Fake News

Usually scientific publications differ in definitions for the term fake news [43]. The intention to create such false news pieces has various reasons. On the one hand, there is a financial motive, where people and companies gain revenue through spreading false articles and generating clicks [15]. Intentions can also be malicious, if the news article is only created to hurt one or more individuals, manipulate public opinion, or spread an ideology [33]. Rubin et al. [29] state that fake articles “[...] may be misleading or even harmful, especially when they are disconnected from their original sources and context.” However, Mahid et al. [22] defined it narrower: “Fake news is a news articles that is intentionally and verifiable false.” This definition is used by several other publications [7, 32]. Some studies have broader definitions of fake news, as Sharma et al. [33]: “A news article or message published and propagated through media, carrying false information regardless the means and motives behind it.” This definition integrates fabricated as well as misleading content. Depending on intention and factuality there are many similar concepts of news that fall under the fake news definition: Misinformation (unintentional) [3], disinformation (intentional) [5], satire [17], fabrications [15], clickbait [5], hoaxes [29], rumors [24], propaganda [5]. In this work we define fake news as the following: Fake news is an article which propagates a distorted view of the real world regardless of the intention behind it.

3 State-of-the-Art

There are many promising approaches to detect fake news during the last years. Accordingly, the methods vary from simple (e.g. Naïve Bayes) to more complex methods (e.g. CNN, RNN, and LSTM) resulting in a wide range of prediction outcomes. Several surveys have been published, that give an overview over methods, such as social-context based, content-based and knowledge-based as well as hybrid detection approaches [24, 26, 33, 43]. However, when focusing on content-based classification, Transformer-based models were recently introduced, having results exceeding or outperforming in a wide range of research tasks [39]. The pre-trained models can be fine-tuned with a dataset of a specific NLP task, where the available corpora are often small [39]. Additionally, word embeddings are a significant improvement for language modeling [16]. Embeddings create a numeric representation of the input with additional positional embeddings to represent the position of tokens in a sentence [12]. The standard Transformer architecture consists of an encoder and decoder with self-attention, to capture the context of a word in a sentence [39].

3.1 Transformer and Language Models

There have been several language models already been made publicly available. ELMo (Embeddings from Language Models) is bilateral and a deep contextualized word representation, developed to improve word embeddings [25] and to predict the next word in a sentence [10]. Also, ELMo uses both encoder and decoder of the Transformer architecture [13]. However, ULMFiT (Universal Language Model Fine-Tuning) uses a multi-layered BiLSTM without the attention-mechanism [10]. Howard and Ruder [14] pre-trained ULMFiT on general data and fine-tuned it on a downstream task, which works well with limited labeled data in multiple languages [14]. GPT (Generative Pre-Training Transformer) on the other hand is a multi-layered Transformer decoder [10], which is an extension of the architecture of ELMo and ULMFiT without the LSTM model [27]. However, the second GPT model (GPT-2) has more parameters than the original (over 1.5 billion), which was only released with a smaller version of parameters to the public [12]. Recently the third version (GPT-3) was released [4]. GROVER is a semi-supervised left-to-right decoder, which is trained on human-written text.

However, BERT is one of the latest innovations in machine learning techniques for NLP and was developed by Google in 2019 [8]. The Transformer “[...] is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers” [8]. For pre-training, Devlin et al. [8] constructed a dataset that has over 800 million words. BERT only uses the encoder of the Transformer structure [16] and the WordPiece embedding model, which has around 30,000 tokens in its vocabulary [8]. The embedding is a combination of multiple tokens, so that fewer vocabulary errors occur [28]. Devlin et al. [8] used two pre-training models. The first one is called Masked Language Modeling (MLM). This means that during training 15% of a sentence is not represented by the original tokens and instead replaced with a “[MASK]” token, so that the model can learn the whole context of the sequence [13]. MLM is used because the masked word would see itself during pre-training, due to the bidirectionality of the model [8]. The second pre-training model is called Next Sentence Prediction (NSP), where the model takes a sentence pair as an input [13].

Liu et al. [21] stated that the BERT model is undertrained and therefore created RoBERTa (A Robustly Optimized BERT). Their model has been trained on additional data with a longer period of time and dynamic pre-training during MLM. They gained state-of-the-art results in GLUE, RACE, and SQuAD and improved the results of the original BERT [21]. After the release of RoBERTa, the authors of study [18] published “A Lite BERT” (ALBERT) version of BERT. They criticized, that the original BERT has limitations regarding the GPU and TPU memory. The training time for the original model is quite long and therefore they set their goal to reduce parameters in BERT. ALBERT gained state-of-the-art results in the following natural language processing tasks: GLUE, RACE and SQuAD. Their results were even better than the before mentioned RoBERTa, despite having less parameters than the original BERT version [18]. The distilled version of BERT (DistilBERT) is another newly developed model with a reduction of the original model size by 40%. The model is 60% faster than the original BERT, which makes it cheaper, while still gaining similar results as BERT [30]. However, XLNet uses autoregressive language modeling and outperforms BERT on 20 NLP tasks, such as question answering, natural language inference, and sentiment analysis. Yang et al. [42] stated that BERT has problems with the masking in pre-training and fine-tuning and therefore used a different approach to gain better results. They also used two streams for the attention instead of only one [42].

3.2 Related Work

A few studies have rather applied stance detection than classification of fakeness in an article to provide new information about false articles. Jwa et al. [16] focused on the stance between headlines and texts of articles with the FakeNewsChallenge (FNC-1)Footnote 1 dataset. Stance detection describes, whether the text is in favor or against a given object. Jwa et al. [16] tested two approaches with BERT. For the first model they only changed the loss function during fine-tuning, whereas for the second model additional news data was gathered for pre-training. Also, Dulhanty et al. [9] used the FNC-1 dataset, but tested it with RoBERTa. Slovikovskaya [37] used the same dataset but added additional data for stance detection. The author used BERT, XLNet and RoBERTa, whereas the latter gained the best result. Similarly to Jwa et al. [16], Soleimani et al. [38] created two BERT models for evidence retrieval and claim verification based on the data of the FEVERFootnote 2 challenge. Another approach on the relation between two titles of fake news was proposed by Yang et al. [41]. They used the data by the WSDM 2019 Classification Challenge on KaggleFootnote 3 with titles in Mandarin.

Regarding binary classification, Mao and Liu [23] presented an approach on the 2019 FACT challengeFootnote 4 with Spanish data. The data was labeled in fact and counterfact. The authors said, that their model was overfitting, hence they only had an accuracy of 0.622 as a result. Levi et al. [19] studied the differences between titles and body text of fake news and satire with BERT as a model. Rodriguez and Iglesias [28] compared BERT to two other neural networks with a binary fake news classification. They used the Getting Real About Fake NewsFootnote 5 dataset with additional real news articles. However, Aggarwal et al. [1] tested XGboost, CNN, and BERT with the NewsFN dataset, which is very well balanced into fake and real articles. Their best result was 97.021% Accuracy with the BERT-base-uncased version.

Liu et al. [20] did a multi-classification on short statements with BERT and had an accuracy of 41.58% with additional metadata and 34.51% with statements alone. Antoun et al. [2] used XLNet, RoBERTa and BERT with a dataset from the QICC competitionFootnote 6 for a binary classification of fake news. Their best model (XLNet) gained an F1-score of 98% accuracy. The second task was a news domain detection, split into six classes: Politics, Business, Sports, Entertainment, Technology, and Education. For this task they used several more models than only the Transformers. RoBERTa gained 94% accuracy, whereas a Bi-LSTM with attention had the same result but an overall better performance. The model was based on word embeddings of ELMo. It has to be mentioned though, that the used dataset only contained 432 articles in total. However, Cruz et al. [6] created a dataset for binary fake news classification for the Filipino language. Additionally, they looked into generalizability across different domains, the influence of pre-training on language models and the effect of attention heads on the prediction output. They used ULMFiT, BERT, and GPT-2 for their experiment, whereas GPT-2 gained the best results with multi-tasking attention heads (96.28% accuracy). The study by Schwarz et al. [31] explored embeddings of multi-lingual Transformers as a framework to detect fake news.

4 Methodology

For this work we used the FakeNewsNet [34,35,36] dataset, which provides news articles that have a binary classification (fake or real) and is automatically updated. Since this work presents a content-based approach, only the body text and titles from the dataset were used. As a ground truth Shu et al. [34] used the fact-checking websites PolitiFact and GossipCop. In this work the following Transformer models were used for the experiments: BERT, RoBERTa, ALBERT, DistilBERT, and XLNet.

4.1 Data Distribution

At the time of downloading the data, the set contained 21,658 news articles. Since in this work the title and body text are needed, all rows, where one of those features was missing, were deleted. After this process the dataset contained 5,053 fake and 15,998 real articles, which are in total 21,041. The mean length of body text was 3408,728 characters, whereas the titles had a length of 59,106. The longest body text in general contained 100,000 and the title 200 characters. The shortest ones 14 (text) and 2 (title). When comparing fake and real body texts it could be observed that the real body texts mean value is about 300 characters longer, whereas the fake titles are about 7 characters longer than real ones. The cleaned dataset was used for the following preprocessing steps and creation of the different files for the experiments.

4.2 Preprocessing

There were different types of preprocessing steps carried out to test, whether the models have different prediction outcomes based on the article length and other factors. The first step was to delete all titles, which were shorter than 20 and longer than 120 characters. Most of the short titles were rather the website names, the articles were published on. Also, the longer titles were often error messages, which the model should not learn the difference of fake and real articles on. This was discovered by going through a sub-sample of titles manually. The same process was used for the body texts, since many short texts were extracted error messages instead of actual content. Therefore, all body texts with more than 10,000 and less than 1,000 characters were deleted. After going through the dataset manually, it stood out that many of the articles that have been labeled as real were transcripts. Transcripts are conversations or interviews, often from politicians and contain mostly spoken word. Since the dataset contains more real articles than fake ones, it could be a problem for the model to distinguish spoken language and written articles. Based on this examination the second preprocessing step was to remove all articles with more than nineteen colons. The transcripts usually started around 20 colons per body text. All articles contained HTML strings, because the dataset was retrieved by a crawler. It stood out that many fake articles contained [edit], which was the only string that was deleted from the dataset, since there are fewer fake articles and the models should not learn the differences between fake and real based only on this. The last preprocessing step included deleting all non-ASCII signs and digits, to see if this makes any difference when evaluating the experiments. Additionally, the newline tags were deleted for all preprocessed files.

In Table 1 all files, with the various preprocessing steps, are shown. They were split in text only, titles only and the concatenation of titles and text. Depending on the preprocessing steps the smallest dataset has a more balanced distribution than the original data: 3,358 fake and 8,586 real articles, which are in total 11,944. The longest text would be 9,919 and shortest 926 characters long. The titles from 20 up to 120 characters.

Table 1. Preprocessed files.

The dataset was split in training set (80%) and test set (20%), which was carried out with a stratified split to balance the classes in both sets. During the implementation of the models, the training set was additionally split into training and validation (10% from training). Depending on the file size and preprocessing steps the classes are more or less balanced (less for the largest dataset). Other standard preprocessing methods, such as removing stop words, punctuation, lemmatization and stemming were not carried out, because the Transformer models need all tokens to understand the context of the sentence. Therefore, valuable information goes missing if the words are cut, deleted or the sentence structure is altered.

Table 2. Used Transformer models for the experiments.

5 Experiments

The experiments in this work were carried out five different Transformer models with the PyTorch version of the HuggingFace Transformers libraryFootnote 7 on a GeForce GTX TITAN X as GPU. The used models are also shown in Table 2. They all have the same count of layers except DistilBERT, which is the distilled version of the original BERT model and therefore only has 6 layers instead of 12.

The first experiments were conducted with file no. 1, which is completely preprocessed and only contains body text. This was also used to figure out valuable hyperparameters for the following experiments. The batch size and maximum sequence length was used as recommended by Devlin et al. [8]. After testing different batch sizes, learning rates, warm-up steps, epochs and sequence lengths, the best hyperparameters were used for the other experiments. In this work, we tested the experiments with more than the usual maximum of 5 epochs to gain insight, whether the loss curves change with more epochs and influence the prediction outcomes. First, the different preprocessed body text files were run through the BERT-base-cased model, then the files containing only titles and then the combination of titles and body text. After looking at the results of the BERT-model, the same hyperparameters were used for other Transformer models.

6 Results

As mentioned before, the experiments were split in only body text, only titles and a concatenation of titles and text of the articles. Also, the cased models were used with no lower-casing during tokenization. For only body text, documented in Table 3, the highest accuracy gained was 0.87. For each experiment the best model is highlighted in bold. The first experiment however shows that the models do not work well with a high learning rate, when predicting the labels on this dataset. The best results are gained with RoBERTa, however accuracy values with XLNet are similar. The results show that all models have a good prediction with different hyperparemeters.

For comparison reasons, the maximum sequence lengths have not been changed over 512 tokens, even when the model had a higher sequence length available. Additionally, the results (Table 3) show that the different preprocessing steps have no major impact on the prediction. Although file 5, which is not preprocessed at all, gains the best results with all models, the accuracy and loss are not significantly apart from other experiments. This shows that deleting transcripts, which could be a learned bias during training, has no further impact on the outcome of the models.

Table 3. Body text only - experiment results.

However, the results of the titles (Table 4) have a lower accuracy result and higher loss than using the body texts. The highest accuracy was 0.85. Again, RoBERTa and XLNet gained the best results, respectively and show the same behavior as with the body texts and preprocessing.

Table 4. Title only - experiment results.

Lastly, in Table 5 the results of the concatenation of titles and body texts are shown. Again, the highest accuracy value is 0.87, but for this type of experiments the best models were DistilBERT and XLNet. The results are only slightly different for each of the models. Also, the preprocessing did not change the predictions significantly. It is notably though, that the experiments gain the overall best results out of the three different types.

Table 5. Title and body text - experiment results.

To compare these results, we applied some of the methods, which were used in the original paper. The authors of the dataset [34] split the data in PolitiFact and GossipCop articles separately. The best result was 0.723 accuracy with a CNN for GossipCop articles and 0.642 accuracy for PolitiFact with Logistic Regression. For our evaluation we used Gaussian Naive Bayes, Support Vector Machine and Logistic Regression with One Hot Encoding and the default parameters of ScikitLearn, as the original paper has done. We used our former preprocessed files for both types, because we also had to apply standard preprocessing, such as: stemming, lemmatization, removing stop-words and punctuation.

Table 6. Comparison of transformer models against a baseline.

As shown in Table 6, the standard supervised methods seem to have problems with either false positive (FP) or false negative (FN) classification results. The only model that has results closely to the Transformer models are the SVM and LR, but seem to train only on one class. On the contrary it can be seen that two experimentsFootnote 8 with the Transformer models have a more balanced confusion matrix, even though one class has more articles in the dataset. This shows, that those models gain better results overall.

Table 7. Sensitivity specific metrics for all models.

Lastly, in Table 7 sensitivity metrics are compared with one experiment on the body textFootnote 9. For each metric the macro-average was chosen, regarding the imbalance of the classes, which shows that Transformers are a better solution for this dataset.

7 Conclusion and Future Work

The results of this work show that a content-based approach can gain promising results for detecting fake news, even without setting hand-engineered features and only titles. Although, literature has shown that some approaches still have better results than the Transformer models. The results of this work are comparable with the current state-of-the-art fake news detection approaches, especially in the field of the newly invented Transformer architectures. Almost all experiments, after the fine-tuning of the hyperparameters, had results over 80% accuracy in the validation and test set without overfitting the data. Therefore, this work shows that Transformer models can also detect fake news based on short statements as well as complete articles. Fake news detection is still underrepresented in the research process. Especially automatic detection, without human intervention, is an open research issue. An important factor for further research is to explore methods of explainable artificial intelligence, to help understanding the difference in fake news concepts and to gain insights into the models and which words have to highest impact to predict the fake and real classes as well as the high accuracy for short titles of news articles and the influence of removing spoken language.