1 Introduction

Detecting fake news is an interdisciplinary problem, as it requires us to examine which methods where used to disseminate the news (e.g., social networks [53]), the links between the various actors involved (e.g., by using the information available in public Knowledge Graphs like Wikipedia), the propaganda tools (e.g., language can often be examined through the lens of semantics [6]) or even the geopolitics (e.g., as proven by the Camdridge Analytica scandal, some news might be targeted to some specific groups who might be more likely to respond to it). At a superficial level it is important to distinguish between satire and political weapons (or any other kind of weapons built on top of deceptive news) [8] or between the various news outlets that spread it, but when analyzing a news item it often helps to deploy a varied Natural Language Processing (NLP) arsenal that includes sentiment analysis, Named Entity Recognition Linking and Classification (NERLC [26]), n-grams, topic detection, part-of-speech (POS) taggers, query expansion or relation extraction [65]. NLP tools are often supported by large Knowledge Graphs (KGs) like DBpedia [35] which collects data about entities and concepts extracted from Wikipedia. The extracted named entities and relations will be linked to such KGs whenever possible, whereas various sentiment aspects, polarity or subjectivity might be computed according to the detected entities. Features like sentiment, named entities or relations render a set of shallow meaning representations, and are typically called semantic features. In contrast, POS or dependency trees render syntactic features.

The underlying assumption made by most models used for detecting fake news is that the title and style of an article are sufficient to identify it as fake news. This is mostly true for news that originate from verifiable bad sources, which is rarely the case anymore. Therefore, we think that taking a holistic approach, that includes a machine generated Knowledge Graph (KG) [43] of all the stakeholders involved in the various events we are interested in is absolutely needed. Such a holistic approach includes methods which can generate and learn graphs of entities associated to fake news.

Our contribution is a method used to integrate semantic features in the training of fake news classifiers. The goal is to show how to use semantic features to improve fake news detection. For this, we compute semantic features (sentiment analysis, named entities and relations) which will be added to a set of syntactic features (POS - part-of-speech and NPs—Noun Phrases) and to the features of the original dataset. On the resulted augmented dataset we apply various classifiers, including Deep Learning (DL) models like Long-Short Term Memory (LSTM), Convolutional Neural Network (CNN), and Capsule Networks. For the Liar data set [62], using semantic features improves the fake news recognition accuracy significantly.

The paper is organized as follows. Section 2 presents the problem description. Section 3 surveys the most recent results in fake news recognition. Section 4 introduces our approach for building machine generated KGs for semantic fake news detection. Section 5 describes the experimental results. The paper is concluded in Sect. 6.

2 Problem Description

It is difficult to establish with certainty the truthfulness of any kind of declaration. Facts-based declarations might be easier to check as they simply require some fast queries on a Knowledge Graph or search engine, whereas political declarations might be more context-dependent and require more fine-grained semantic information. This is the main reason why this field of study became so important in the last decade, once social media analysis at scale became a reality.

2.1 Background

There are various definitions of fake news. Most of them are based on Alcott and Gentzkow’s paper [4] about the impact of fake news on the 2016 US Election.

Definition (based on [4]). A news item or a part of a news item will be considered fake if it can be verified that its content is false.

Fake news detection can be considered a part of a larger class of tasks focused around fact checking [58]. Some related tasks include fact verification under open-world assumption, common sense reasoning (e.g., understanding the arguments required to support certain premises), subjectivity and emotive language detection (e.g., predicting whether the document originates on websites well-known for spreading hoaxes or propaganda), deceptive language detection, rumor detection (e.g., identification and classification of unverified reports), speaker profiling or click bait deception.

From the point of view of Machine Learning, fake news detection is a binary classification (e.g., true or false) or multi-class classification (e.g., when multiple degrees of truth are taken into account) problem. Input data includes a statement and some information about it (e.g., speaker, location, party affiliation). The expected output is a binary label or a more fine-grained label (e.g., true, mostly true, etc).

In order to perform semantic fake news detection, some additional statements like the past truth history of a speaker or the relations between speakers and publishers should be considered if possible. The idea of using past inaccuracies for each speaker was introduced with the Liar data set [62] and named credit history, but it is rarely used in practice.

Definition (based on [62]). Credit History (CH) is the historical count of false (or provably untrue) statements for an actor.

A credit history score can also be replaced by a single aggregated count of all the untrue values. Such credit scores allows us to understand diverse perspectives when analyzing news and helps determine which person or group might benefit from spreading certain news. An earlier iteration of this idea was also explored in the context of social media networks: credibility propagation [27].

2.2 Problem Statement

Expanding upon the idea of semantic fake news detection, it is important to look closely at two concepts: the ideas of credit score and degree of truthfulness

Generalizing the idea of credit score we will be able to define it as a graph that models all the relations between the various entities present in a statement or document. Such a graph can be seen as a Knowledge Graph [18] if the information is fine-grained and can lead to some good inferences.

Definition A credit history graph is a graph that contains all the entities, their credit histories and links between them as they are available from a Knowledge Graph (KG) or generated from a collection of texts.

Credit history can be extended to cover all important entities related to a fake news (e.g., speakers, publishers, etc). By doing this, the resulting graph will be similar to the mention-entity graphs used in Named Entity Linking processes which include all the links between the various entities from a text. The main difference will be that the extended credit history scores can be imposed over the mention-entity graphs becoming weights. Such scores can be created automatically through similar processes (e.g., counting, aggregation, rule-based generation) like those applied for the creation of machine generated Knowledge Graphs [19].

Credit history itself however is not enough, as it merely provides us with proxy indicators about the truthfulness of some of the actors involved in a statement. Due to the fact that information about these actors is generally incomplete regardless of the used Knowledge Graph (KG), we can not expect these indicators to be reliable predictors. Adding some fine-grained semantic information like sentiment, entities or relations should help us build better credit histories. This kind of semantic information can also be considered an alternative to the credit history. Such features can be extracted from both traditional KGs (e.g., DBpedia, Wikidata), as well as from the texts themselves.

Definition Relational features include all the features extracted directly from the texts or the named entities detected in them through the exploitation of Knowledge Graphs.

While we focus on extracting all the needed features directly from the text, the Tri-Relationship framework described in Shu’s paper [54] also deserves a mention here, even though it is focused on the objects involved in distributing the news (e.g., people, organizations). All the mentioned approaches share the idea of enriching the fake news text with a set of annotations, in order to provide some context. This naturally leads to our main research question:

What are the most useful semantic features that can be used to improve fake news detection? Ideally, such features should be integrated into the neural models, whenever possible. Today, due to the cost of developing good semantic systems, some of these features might come from various external tools. The semantic features need to be selected according to the task and dataset at hand. If the task refers to the detection of fake news as spread by people via their statements, then the main entities we will be interested in might include people, organizations, locations and events.

The recent success of language models based on Transformers [59] like BERT [15] or RoBERTa [39] might potentially change the entire field of Natural Language Processing. Such models have been shown to perform well on a variety of tasks including sentiment analysis, Named Entity Recognition (NER), semantic role labelling or dependency parsing. This is partially due to their ability to pick up various language phenomena like direct objects, noun modifiers or coreferents [13]. Examining questions about semantic features or pre-Transformers NLP models might not lead us to the good results obtained with the BERT-inspired language models. However, considering the fact that a lot of these new language models are quite large, expensive to train and potentially damaging to the environment [57], we think this is a task that is worth pursuing. Even more importantly, Transformers are not the only models that are currently showing good results in NLP, as bidirectional LSTMS [33], Capsule Networks [51] or variational autoencoders like Vampyr [21] were also shown to provide good results for a fraction of the cost and time required to train Transformer models.

3 Related Work

Several articles per week were published about fake news in the last years, around 40% of them being dedicated to actual models (neural nets or others) developed for classifying, creating or examining fake news. The interest in this topic has actually increased a lot after the mid-2018 boom of NLP language models. The large majority of these articles are simply surveys or political studies in which the propagation of fake news plays a central role. It has to be noted that this type of disinformation does not extend only to political news, but to any kind of news (e.g., sports, entertainment, culture), including scientific publications, therefore many articles simply examine various facets of it. Nevertheless, due to the rapid expansion of the literature on this subject, we have chosen to focus this section on several aspects we considered important, mainly detection methodology and models.

An exploration of the fake news phenomena during more than a decade (2006–2017) was built around Twitter rumor cascade by a series of social scientists [61]. Multiple surveys (e.g., [53, 66]) are focused on building various fake news classifications. A good survey of the various fake news types (e.g., visual-, user-, post-, network-, knowledge-, style-, stance-based) and the typical methods that were used for their detection can be found in [44]. A recent survey [66] identifies not only the major types of fake news (fake news, biased/innaccurate and misleading—each with its own subcategories), but also goes on to classify the various actors involved in spreading the news, their motives, as well as the various methods that can be used to combat the propagation of false information. Another survey shows a data mining perspective for the spread of fake news in social media [53]. Rubin [49] classifies deceptive articles into three large classes (serious fabrications, hoaxes, humor or satire) and defines a set of criteria for creating a good text corpora for fake news detection [48], namely the fact that such data sets should only contain verifiable facts that happened in a certain interval and were reported using similar style though with various degrees of cultural influences. Any such corpora should only focus on text-only items as they would be easier to process.

A recent survey on the role of fake news detection in decision making [23] identifies three large areas of interest: taxonomy of false information (e.g., rumor, fake news, hoax, misinformation), Machine Learning (ML) techniques (e.g., supervised, unsupervised, semi-supervised) and Deep Learning (DL) techniques (e.g., supervised, unsupervised, etc).

Most of the time, simply analyzing will not yield good results, therefore some models also incorporate some data about the networks (e.g., social media, organizations) through which the news spread. Ruchansky [50] proposes a CSI model which stands for Capture, Score and Integrate, therefore combining information on the temporal activity of the users, their behavior, and a classifier. Similar ideas, but oriented towards identifying the geography of fake news, were described in [17]. An approach that is closer to verification is presented in [5], the focus being on a set of annotated statements is verified across top news sources (e.g., New York Times, CNN, The Guardian, etc). An approach that that checks various parts of articles is also presented in [55] and uses 3HAN, a Hierarchical Attention Network (HAN) network with three layers.

A model for early detection of fake news based on news propagation paths is described in [38] and is based on a hybrid time-series classifier that contains both Recurrent Neural Networks (RNNs) and CNNs. Wu [63] assumes that intentional fake news are typically manipulated to look like real news. He built a classifier based on social media propagation pathways using LSTM-RNN and embeddings. Vo and Lee [60] focus on the story told by fake news URLs and the co-occurence of various entities through such links. Their GAU model combines Guardian-Guardian SPPMI matrix, Auxiliary information and an URL-URL SPPMI matrix and is shown to outperform a series of baselines inspired form recommendation systems (e.g., Matrix Factorization or CoFactor).

A set of LSTMs is used for performing a multi-source multi-class fake news detection (or MMFD) in [28]. The advantage of this method is the multi-source fusion of the MMFD framework, since it can determine various degrees of fake news. The accuracy of the approach is not very high, but given the fact that it combines three large components (automated feature extraction, multi-source fusion and fakeness discrimination) it is promising. Aghakhani [2] showed that a Generative Adversarial Network (GAN) [20] can perform relatively well for detecting deceptive reviews.

Latest NLP multi-task learning models are usually built around Transformer architectures [59]. An early paper from the WSDM fake news challenge [64] shows that both a number of models that use BERT architecture or embeddings perform quite well. Several of the systems submitted for the SemEval 2019 Task 4 Hyperpartisan News Detection have also used Transformer architectures [29]. A two-stage model that uses BERT architecture [15] and relies on both information and attention mechanisms is presented in [37] and achieves an accuracy that is higher with more than 10% than the original baseline. Due to the rise of generative language models like BERT or GPT-2 [56], neural fake news took the media and public by storm and caused a lot of discussions about ethics in AI development. A staged release strategy like the one used for GPT-2 was essential in order to extend the time needed for experiments and minimize potential damage. Another strategy to defend against neural fake news is to train the model against itself, similar to how GAN works as described in [67]. While neural fake news generators can be seriously damaging, it is worth considering them in order to increase the size of training data sets.

Including all the various articles that showcase the latest advances in sentiment analysis or Named Entity Linking or relation extraction is beyond the scope of this paper. However, we think it is important to note at least several surveys that cover these topics, as these areas are very important for the analysis of fake news. A good explanation on the relations between Named Entity Linking, Information Extraction and Knowledge Graphs can be found in [3]. Good sentiment analysis engines are difficult to build as they are generally umbrella technologies that are built on top of a chain of tools that might include POS taggers, Named Entity Linkers or relation and aspect extractors as described in [10]. A survey that includes details about the most common Deep Learning algorithms to be used for sentiment analysis or named entities extraction/classification, can be found in [65].

An early version of the technique presented in this article [9] focused on balanced classification. The current article describes the more general unbalanced classification problem while updating all the models in line with current literature and significantly expanding the experimental section (e.g., provides results on multiple data sets, more in-depth explanations of the results).

4 Methodology

In order to fully exploit the relations between the entities mentioned in a news statement, our procedure includes the following steps:

  • Metadata collection The first step is to simply collect the sentiment, entities and additional metadata available from third party tools.

  • Relation Extraction A second pass will collect both (i) the general relations found in a KG, and (ii) those computed from the current texts.

  • Embeddings Last step refers to the adaptation of various neural models (e.g., by adding a layer of embeddings) for improving fake news detection.

The features included in the last step will be only internal, whereas the features included on the other steps can also be external. The entire process is illustrated in Fig. 1.

The intuition behind the current data modeling that led to the additional semantic features is that by adding extracted entities and making a clear distinction between direct and indirect speech, we can create the premises for more sophisticated analysis that may pinpoint the personal history of a speaker with both the issue at hand (or subject), as well as with all the parties involved in the respective issue. If such an analysis is extended, down the road, it should also be possible to identify more obscure details about a speaker, for example if a named entity labelled as politician follows the party line or not. In other words, it opens up the possibility of using the graphs to peak behind the scenes of various declarations.

Fig. 1
figure 1

External and internal semantic features for neural network models

4.1 Metadata Collection Pipeline

Our pipeline for generating metadata has following components:

  • Sentiment Analysis (SA) Sentiment annotations can exist on multiple levels: (i) document; (ii) sentence; (iii) aspect-based [65]. Current state-of-the-art systems are typically aspect-based, therefore all the aspects of the entity features can get an estimate of the sentiment value. Since our data set (the Liar data) contains short statements, we use aggregated sentence level sentiment polarity and subjectivity values computed with the TextBlobFootnote 1 library.

  • Named Entities (NE) Since the results for NE extractions are typically good enough [26], almost any modern NLP library can be used for this task. Both Spacy and Stanford NLP libraries have good performance.

  • Named Entity Links (NEL) Generally NERLC (NER+linking and classification) tasks are considered more complicated and typically require dedicated NEL engines [26]. Any good NEL engine can be used for this task. We use a Python wrapperFootnote 2 around a DBpedia Spotlight [14] instance that was run locally using the English model.Footnote 3

4.2 Relation Extraction

Instead of using existing solutions, we develop a simple Relation Extraction (REL) component that queries DBpedia. Where possible, the existing entities are enriched with additional data obtained via a SPARQL query from DBpedia. This is particularly important in order to discover more relations between a speaker (which we will call source entity) and his/her subject (which we will call target entity). We consider two types of relations:

  • (i) extracted directly from the provided news statements by defining the types of relations we are interested in via POS tags (for example, for extracting relations between two entities we will generally be interested in NP–V–NP chains—a verb between two proper nouns, whereas additional relations for an entity can be added by extracting S–V–O triplets);

  • (ii) extracted from the DBpedia Knowledge Base (e.g., if dbr:Donald_Trump mentions dbr:Barack_Obama in a document, all the triples that belong to these entities are extracted from DBpedia and a subset of common links like dbo:orderInOffice or dbo:President is identified).

The machine generated KG includes all the DBpedia triples that belong to the entities collected from the data set. The relations extracted from text are schemaless, whereas the relations extracted from KG are grounded to a schema (e.g., DBpedia ontology). This component is implemented with the Python libraries RDFLib, SPARQLWrapper and Spacy.Footnote 4

A more complex graph can also be built if we consider a split of the entity links by topics. In such a scenario a speaker can be a friend of his subject (e.g., though only in the context in which the subject is a person or organization) when he is discussing one topic, but also an enemy when discussing another topic. This closely models the real life allegiances from politics or business, but not necessarily from other fields like sports or culture, therefore it will be left outside the scope of this publication as it requires a more complex analysis.

If the data sets include full texts and not just short sentences, the semantic modelling and the associated graphs will quickly become complex. In such scenarios it is best to use a standard like the RDF Data Cube Vocabulary (QB),Footnote 5 as it is also important to keep the context in which each statement was made, therefore multiple counts can be modelled as observations with a source (speaker) and target (subject of the speech). It has to be noted that in order to reduce the size of the graph it is better to prune entities according to their type (e.g., keep only people, organizations, locations and events).

4.3 Embeddings

Shallow neural architectures that learn word embeddings from distributional semantics (e.g., continuous bag of words architectures like Word2Vec, GloVe or fastText [42]) have been successfully applied to classic NLP problems [65], and should be an integral part of any NLP architecture. Such architectures generally provide fast computation times and lead to good results due to the fact that they capture relational similarities. These embeddings can be applied to a wide-range of tasks (word analogy, human similarity judgements, sentiment analysis, word prediction, hybrid recommendation, etc.).

If the used corpora is clean and large enough (several tens of thousands of examples [42]), embeddings can be an ideal solution for building baselines. Only the most used (word2vec, GloVe, fastText) pre-computed embeddings were included for the top 60k English words. The component that loads them uses negative sampling and Glove with a fixed size of 300 (glove.840B.300d). The Keras API offers the possibility to add an embeddings layer to a neural network. This layer can be used for: (i) learning and saving the embeddings together with the word vectors; (ii) loading pre-trained embeddings. In all our DL models, we place such a layer after the inputs and use it for loading embeddings. Such a layer is effective especially when the number of training examples is relatively small [45].

5 Experiments

The success of our approach depends on a series of components for extracting sentiment scores, named entities, or relations. Therefore, if those components do not perform well, the whole approach will be flawed. First, we would like to find out if such an approach is valid. Therefore, missing a named entity from a statement might not be extremely important at this stage. If the approach proves to be valid, then further work needs to include additional evaluations for all the components in the pipeline, or at least some of their performance scores (when available).

5.1 Data

We use several similar data sets that contain annotated data with multiple degrees of truth for our experiments.

The Liar data sets [62] contains politics-related short texts extracted from the Politifact API and classified based on the degree of truth, while also offering credit histories that track the accuracy of the speaker statements. The data set is split into three partitions (train, test and validation) and includes six classes that need to be predicted: True, Mostly True, Half True, Barely True, False and Pants-on-fire. The initial paper about the Liar data set [62] identified SVMs as best classical models and CNNs as the best Deep Learning classifiers. A follow-up paper [40] indicates that LSTMs would be even better. Since our focus is not on credit history (five counts for all the classes that are not True including the score for the current statement) but on the impact of the relational features, we do not reproduce those results and do not compare with them (Table 1).

Table 1 Data set statistics

The Politifact data set [46] also contains short texts extracted from the Politifact API, but, as opposed to the Liar data set, does not provide any additional features except for the identity of the speaker. The data set is split into two partitions (train and test) and includes the following classes that need to be predicted: True, Mostly True, Half True, Mostly False, False and Pants-on-fire. The paper that introduced the data set [46] shows that LSTMs with text and no additional features perform best in both 2-classes (when the data is split only in True and False) and 6-classes (when keeping the original annotations) scenarios. On the associated website, there are an additional five thousand texts from Politifact sister websites that were not annotated and which can be used if there is a need to extend this data set.

As it can be seen the two data sets are quite similar. In fact, except for one class that has a different name, even the labels are exactly the same (True, Mostly True, Half True, False and Pants-on-fire). The classes with a different label (Barely True or Mostly False) however clearly contain similar examples. Due to this, we considered the two data set as being siblings, and ideal for our experiments. Both data sets display a low number of pants-on-fire examples in all their partitions. Both data sets contain multiple examples for a set of speakers, but Liar also contains some examples taken directly from political campaign statements were the speakers might not be clear or easy to identify. The length of the examined statements is somewhat similar being around 18 words and 105–110 characters (e.g., Liar has an average of 17.9 words, whereas Politifact has an average of 18.32). Since classic tweet length was around 140 characters, we can easily consider these texts as being somewhat equivalents to old tweets, the main difference being the lack of specific language and shortcuts (e.g., no emojis or RT handles are included in any of these data sets).

Table 2 Accuracy for the test set runs on the Liar dataset
Table 3 Accuracy for the test set runs on the Politifact dataset

Taking into account the various combinations of text, original data set features and relations, we can describe four cases for the experiments. The texts themselves (named text (T)) are simply statements that are taken out of their original context. The features included in the original data set (text+attributes (T+A)) contain information about the subject, speaker (including his job title, state and party affiliation), as well as credit history, and the context (the speech’s location). The set text+relations (T+R) has semantic features (sentiment polarity, sentiment subjectivity, entities, links, and relations), syntactic features (NP), and the aggregated score of the credit history counts. The features included in the T+R data set are all extracted directly from the statements—there is no need to use the full text of the articles to compute them. This is an important detail, since this operation can always be performed if we have a good set of tools for metadata generation, even when the full articles are no available. The last set of features (identified as all (ALL)) includes all available features. Since Politifact does not really have additional features (the only feature beside the text is the speaker name), our experiments discard the text+attributes (T+A) case. This makes it easier to compare results on the two data sets.

The classes are unbalanced. In Tables 2 and 3 we report the test set accuracy scores for all considered models and additional features for both data sets (Liar and Politifact).

5.2 Models

We start by testing several “classic” models [24] that were built with scikit-learn (Table 2). For these models, using the relational features (T+R) shows some improvements, typically 2–3% above the original features (T+A) of the data set. However, the best score are far from optimal. Logistic regression and decision trees scores prove to be quite similar for all the three runs, while simultaneously being the worst scores. We notice a single case (the random forest classifier) in which the added relational features do not yield improvements over a run with only the original text. The best “classic” ML classifier proves to be the SVM, confirming the results from [40].

In the second phase, we test several DL models. The DL models are built with Keras [12] and TensorFlow [1], and use hot encoding of the class labels. For the DL models, the reported evaluation metric is accuracy with Adam optimizer [32].

The following DL classifiers are used, all of them with categorical crossentropy loss function and Adam optimizer:

  • CNN—is a simplification of the models described in [31, 40]. It contains an embedding layer with dropout set to 0.2, a Convolution1D which learns how to filter groups of words, a GlobalMaxPool layer, as well as a basic hidden layer (dense, dropout set to 0.2 and relu activation function). The result is projected on a single unit output layer squashed with a softmax.

  • BasicLSTM—is a simple LSTM with dimension 300, a GlobalMaxPool layer, spatial dropout set at 0.2 and dense layers with softmax activation. Some of the hyperparameters include batch size of 256, epochs set to 20 and learning rate set to 0.001.

  • BiLSTM [11]—a bidirectional LSTM (CuDNNLSTM) with attention, dropout and recurring dropout set at 0.25 and a dense layer with softmax activation. We used same hyperparameters like the previous model.

  • GRU [25]—a GRU with attention, otherwise similar to the previous BiLSTM model. We used same hyperparameters like the previous model.

  • CapsNet model represents a simplified version of the models described in [16, 30]. It uses a Capsule layer instead of the GlobalMaxPool layer used in the other models described here. However, instead of using the Convolution+ReLU technique described in the article, we use a Bidirectional GRU with dimension 128, ReLU activation, droupout and recurrent dropout set to 0.25. The result is projected on a single unit layer squashed with a sigmoid. Some of the hyperparameters include batch size of 256, learning rate of 0.001, number of capsules 10 with dimension 16 and 5 routings. Training took around 5 epochs.

Regardless of the model, some basic preprocessing steps have been performed. These steps included text cleanup, stopwords removal and tokenization. Labels were encoded using LabelEncoder. The DL models have also included additional steps like turning the texts into sequences and padding.

All the DL models, besides the TextCNN and BasicLSTM, use Glove embeddings (more specifically—glove.840B.300d). We did not perform additional tuning of the DL models. We noticed that the embeddings for the most used 60k words from the English language have almost no effect on the results. The input vectors were loaded using Keras’s embeddings layers which is defined as the first hidden layer of a network. For the DL experiments, we used whenever possible pre-trained models. Of course, fine-tuning the architectures may improve these results. We used same batch size and learning rate for all models. Whenever possible, we also tried to train for a similar number of epochs.

Since developers quite often stack different techniques, it is sometimes difficult to understand what a certain technique adds to a model. The study in Table 4 offers some explanation on which techniques might help when solving multi class classification problems focused on short texts. The simple models (TextCNN, BasicLSTM) do not include embeddings, therefore their results can easily be explained by the current study. For the more complex models (BiLSTM, GRU with Attention), the better performance is also due to embeddings and attention mechanisms. As it can easily be observed, extracting relational attributes can yield better results than using the original features, but when combining both set of features the results improve significantly.

Table 4 Ablation study

Sometimes for NLP models it is enough to simply use the text and embeddings. This does not mean that the quality of the embeddings or the quality of the preprocessing will be the only factors affecting the output. In fact the words themselves will play a role in the result in this case. Such contributions can generally be understood through the explanations provided by libraries like Lime [47] or Shap [41] today. After collecting some random training samples that have roughly a quarter of the full training sets from each of the two data sets, we have used Shap to produce a series of graphics with 20 word features that are likely to have an impact on an LSTM model. Interestingly enough, around half of these words were found in multiple samples in both data sets. Such words that carry some weight include percent, president, government, health, taxes or president names. Figure 2 shows the word features that were rendered as important for the Liar data sets. Numbers were removed from this list, as we considered that they will not necessarily repeat themselves as they will be more context-dependent. Even more interesting, the word percent has been the top contender in both data sets. This suggests that fake news are likely to contain not just some information about a president or government, but also various indicators. This also suggests that it is necessary to use good preprocessing techniques. It is important to note that the various methods to measure feature contributions can lead to different results, therefore we consider the results provided by Shap or Lime more as guidelines to help our future developments.

Fig. 2
figure 2

SHAP explanations for a random sample from the Liar data set showing the top 20 word features with an average impact on model output magnitude

5.3 Discussion

We note that all the DL models obtain better scores than the classic models with the same features. While the current literature is mostly focused on CNNs and basic LSTMs, we observe that attention models and CapsNet models performed best. For all DL models, adding our features results in an accuracy increase of up to 4.2% without any additional techniques like embeddings or attention on unbalanced classes. After adding embeddings and attention (e.g., BiLSTM model) scores again improve significantly (more than 10%, as it can be seen by comparing the basicLSTM score with later models that also contain these techniques).

As noted in [9], in all cases, relational features (T+R) perform better than the original features of the data set (T+A), which suggests that in some cases it might be enough to simply collect texts and build the rest of the features from metadata. While this paper represents an expansion of the previous one, this observation still stands.

We have not repeated all feature combinations from the original data sets as presented in Wang [62], Long [40] or [46], but rather took the best feature combinations found in those papers and added new combinations based on the relational features proposed by us.

The scores obtained obtained by us for SVMs, basic CNNs and LSTMs confirm their results. Using relational features (sentiment, recognized named entities, named entities links, relations) together with syntactic features (NP), it is already possible to beat the current baselines at a comfortable distance, even without using advanced architectures. It is even possible to use only these semantic and syntactic features, instead of the original ones, and the scores will still be better than the baselines. Taking this into account, we think it would actually be preferable to use the method described in this paper in order to build reliable baselines.

As expected, the results for T+R and ALL were quite similar for the Politifact data set, as this data set does not contain any extra features besides the speaker. The fact that overall the results for Politifact were also somewhat lower than the results obtained for Liar data set is also in line with current literature results.

We tried to minimize the number of input features. Depending on the length of the text and number of entities involved, the number of additional features can be increased—which may lead to some increase in the overall performance. The most important thing when using our technique is to select the appropriate additional features that can lead to performance improvements. As it can be seen, even selecting several features like relations, sentiment and entities, can lead to significant improvement.

Even without any additional features, it is important to remember that the words included in the news statements themselves will carry some weight, as it can easily be seen in Fig. 1.

6 Conclusions

While the literature on fake news detection is increasing at fast pace, the accuracy of the various models greatly varies depending on the data sets and the number of classes involved. In our view, good models should be adaptive and should not require a lot of fine-tuning on data sets. According to our results, by also considering relational features like sentiment, named entities or facts extracted from both structured (e.g., Knowledge Graphs) and unstructured data (e.g., text), we generally obtain better scores on most classifiers. Good text preprocessing techniques, as well as of the associated embeddings, prove to be very important to any model, as even in the case of a simple model with no additional features (e.g., BasicLSTM with no features), the word features themselves will carry certain weights which can be visualized with modern libraries for explaining predictions like Shap.

Currently, most models are based on word embeddings, even though phrases and multi-words expressions perform better for longer texts. This is due to the fact that the language used in a fake news article may differ from the language used in a normal article, as it is often needed to reinforce certain claims.

Since the classes of the two data sets are quite similar, another future work direction is the creation of a large super data set for fake news. Around five thousands statements that were not yet fully annotated are still available on Rashkin’s Politifact web page [46]. Several other data sets could be re-annotated using this scheme. Having a larger data set will help improve the results on long term. This also clarify if the techniques presented here can work on different domains, as the two datasets we have used are rather similar and can even be merged as a single dataset.

Ultimately, we think, that the value of this method does not lie in the fact that it helps us build the best fake news classifier, but rather in helping us quickly build reliable baselines upon which we can improve. Taking this into account, we will continue building models, datasets and experimental pipelines for fake news in the near future.

Some future investigation areas include exploiting these relational features together with graph neural networks, like the recently developed R-GCN [52] or using a single multi-head attention architecture [59] to generate all the semantic features. In the context of Transformers, it would be important to understand what is the best strategies for transfer learning and knowledge distillation, as these can be helpful also for other Deep Learning models. Another interesting direction is to use semantic features for detecting fake reviews. While this is somewhat similar to the fake news detection, the goal there is to detect fake accounts on websites like TripAdvisor or fake authorships.