Keywords

1 Introduction

Since the earliest times, long before the advent of computers and the web, fake news (also known as deceptive news) were transmitted through the oral tradition, in the form of rumors (face to face) or in the yellow/sensational press, either to “innocently” talk about other people lives, or to intentionally harm the reputation of other people or rival companies. Nowadays, social networks and instant messenger apps have allowed such news to reach an audience that was never imagined before the web era. Due to their appealing nature, they spread rapidly [20], influencing people behavior on several subjects, from healthy issues (e.g., by revealing miraculous medicines) to politics and economy (as in the recent Cambridge Analytica/Facebook scandalFootnote 1 and in the Brexit situationFootnote 2).

As the spread of fake news has reached a critical point, initiatives to fight back fake news have emerged. On the one hand, journalistic agencies have supported fact checking sites (e.g., Agência LupaFootnote 3 and Boatos.orgFootnote 4) and big digital companies (as FacebookFootnote 5) have attempted to block fake news and to educate users. On the other hand, academic efforts have been made by studying how such news spread, the behavior of the users that produce and read them, and language usage characteristics of fake news, in order to identify such news. This last research line - on language characteristics - has been mainly explored in the Natural Language Processing (NLP) area.

In NLP, the attempts to deal with fake news are relatively recent, both on the theoretical (e.g., [7, 12, 24]) and practical points of view [1, 11, 16, 18]. Some previous work has showed that humans perform poorly on separating true from fake news [3, 10] and that the domain may affect this [14], but others have produced promising automatic results. Despite the advances already made, the lack of available corpora may compromise the evaluation of different approaches.

To fill this important gap, in this paper we investigate the issue of fake news detection for the Portuguese language. Inspired by previous initiatives for other languages, to the best of our knowledge, we introduce the first reference corpus in this area for Portuguese. This corpus is composed of aligned true and fake news, which we analyze to uncover some of their linguistic characteristics. Then, using traditional machine learning techniques and following some of the ideas of [16, 22], we perform tests on the automatic detection of fake news, achieving good results. One of our main goals is that our corpus and methods may support future researches in the area.

The remainder of this paper is organized as follows. In Sect. 2, we briefly review the essential related work. Section 3 offers details about the newly-created corpus. In Sect. 4, we report our machine learning approaches for fake news detection. Finally, Sect. 5 concludes this paper.

2 Related Work

According to [17], there are three main types of deception in texts: (i) the ones with humor, clearly for fun, using sarcasm to produce satires and parodies; (ii) fake content, which intends to deceive people and to cause confusion; and (iii) rumors, which are non-confirmed and usually publicly accepted information. Fake content, in particular, may appear in different contexts. Fake news are a type of it, as well as fake reviews, for instance, that are tailored to harm or to promote something.

Although the recent interest growing in the area, there are several available corpora of different types of deception. In [15], the authors present three datasets related to social topics, such as opinions on abortion, death penalty, and feelings about a best friend, containing 100 deceptive and 100 truthful sentences. In [18], the authors build two datasets containing satirical and true news in four different domains (civics, science, business, and “soft” news), totalizing 240 samples. In [14], two datasets are collected on the celebrity news domain. The first one consists in emulating journalistic writing style, using Amazon Mechanical Turk, resulting in 240 fake news. The second one is collected from the web, following similar guidelines to the previous dataset (aiming to identify fake content that naturally occurs on the web), resulting in 100 fake and 100 legitimate news. Other corpora are available in English, such as the Emergent [8] and LIAR [21] corpora. For Portuguese, it is possible to find some websites that compile true and fake news for fact checking (as the ones cited in the previous section), but they often present comments about the news (and not the original texts themselves) and are not ready-to-use corpora for NLP purposes.

Some methods for detecting deceptive content have been investigated, using varied textual features, as commented by [5] and systematized in [24]. [1] studies false declarations in social networks, looking for clues of falsification (lies, contradictions and distortions), exaggeration (modifiers and superlatives), omission (lack of information, half truths) and deception (subject change, irrelevant information and misconception). [22] proposes to look at the amount of verbs and modifiers (adjectives and adverbs), complexity, pausality, uncertainty, non-immediacy, expressivity, diversity and informality features. In [14,15,16], the authors compare the performance of classifiers using n-grams/bag of words, part of speech tags and syntactic information, readability metrics and word semantic classes.

Despite the efforts already made, as far as we know, there is no public and labeled dataset of fake news written in Portuguese. The absence of representative data may seriously impact the processes of development, evaluation and comparison of automatic detection methods. In what follows, we report our efforts to build the first reliable corpus in this area for Portuguese.

3 The Fake.Br Corpus

In order to create a reliable corpus, we have collected and labeled real samples written in Portuguese. The corpus – simply called “Fake.Br Corpus” – is composed of true and fake news that were manually aligned, focusing only on Brazilian Portuguese. To the best of our knowledge, there is no other similar available corpus for this language.

Collecting texts to the corpus was not a simple task. It took some months to manually find and check available fake news in the web and, then, to semi-automatically look for corresponding true news for each fake one. The manual step was necessary to check the details of the fake news and if they were in fact fake, as we wanted to guarantee the quality and reliability of the corpus.

The alignment of true and fake news is relevant for both linguistic studies and machine learning purposes, as positive and negative instances are important for validating linguistic patterns and automatic learning, depending on the adopted approach. Besides this, the alignment is a desired characteristic of the corpus, as pointed by [17], which also suggests the following for assembling the corpus: news should be in plain text format, as this is usually more appropriate for NLP; the news must have similar sizes (usually in number of words) in order to avoid bias in learning, but, if this is not the case, size normalization (e.g., text truncation) may be carried out when necessary; specification of a time period for collecting the texts, as writing style may change in time and this may harm the corpus purposes; maintenance of pragmatic factors, e.g., the original link to the news, as such information may be useful in the future for fact checking tasks [13].

Overall, we collected 7,200 news, with exact 3,600 true and 3,600 fake news. All of them are in plain text format, with each one in a different file. We kept size homogeneity as much as we could, but some true news are longer than the fake ones. We established a 2 years time interval for the news, from January of 2016 to January of 2018, but there were cases of fake news in this time period that referred to true news of a time before this. We did not consider this as a problem and kept these news in the corpus. Finally, we saved all the links and other metadata information (such as the author, date of publication, and quantity of comments and visualizations, when possible) that was available.

We manually analyzed and collected all the available fake news (including their titles) in the corresponding time period from 4 websites: Diário do Brasil (3,338 newsFootnote 6), A Folha do Brasil (190 news), The Jornal Brasil (65 news) e Top Five TV (7 news). Finally, we filtered out those news that presented half truthsFootnote 7, keeping only the ones that were entirely fake.

The true news in the corpus were collected in a semiautomatic way. In a first step, using a web crawler, we collected news from major news agencies in Brazil, namely, G1, Folha de São Paulo and Estadão. The crawler searched in the corresponding webpages of these agencies for keywords of the fake news, which were nouns and verbs that occurred in the fake news titles and the most frequent words in the texts (ignoring stopwords). About 40,000 true news were collected this way. Then, for each fake news, we applied a lexical similarity measure (the cosine measure presented in [19]), choosing the most similar ones to the fake news, and performed a final manual verification to guarantee that the fake and true news were in fact subject-related. It is interesting to add that there were cases in that the true news explicitly denied the corresponding fake one (see, e.g., the first example in Table 1), but others were merely on the same topic (second example in Table 1).

Table 1. Examples of aligned true and fake news.

Overall, the collected news may be divided into 6 big categories regarding their main subjects: politics, TV & celebrities, society & daily news, science & technology, economy, and religion. In order to guarantee consistency and annotation quality, the texts were manually labeled with the categories. Table 2 shows the distribution of texts by category. As expected, politics is the most frequent one.

Table 2. Amount of documents per category in the Fake.Br corpus.
Table 3. Basic statistics about the Fake.Br corpus.
Table 4. Linguistic features of [22] in the Fake.Br corpus.

Table 3 shows a overall comparison of the news, including the average number of tokens and sentences, as well as several other features. It is interesting to notice some differences, e.g., spelling errors were more frequent in the fake news.

Finally, we adopted the proposal of [22] to compute other linguistic features that may serve as indications of fake content, to know: (i) pausality, which checks the frequency of pauses in a text, computed as the number of punctuation signals over the number of sentences; (ii) emotiveness, which is an indication of language expressiveness in a message [23], computed as the sum of the number of adjectives and adverbs over the sum of nouns and verbs; (iii) uncertainty, measured by the number of modal verbs and occurrences of passive voices; and (iv) non-immediacy, measured by the number of 1st and 2nd pronouns. Table 4 shows the values of these features in the corpus. The higher differences in uncertainty and non-immediacy values are due to the size difference of the texts, as these two metrics are not normalized.

In what follows, we present our experiments on fake news detection using the above corpus.

4 Experiments and Results

Motivated to create an automatic classifier of fake news, we run some tests using machine learning over the Fake.Br corpus. To guarantee a fair classification, we have normalized the size of the texts (in number of words) by truncating the longer texts to the size of their aligned counterparts.

Following [16], we run the widely used SVM technique [6] (the LinearSVC implementation in Scikit-learn, with default parameters). We tried different features of [16, 22]:

  • bag of words/unigrams (simply indicating whether each word occurred or not in the text, using boolean values), after case folding, stopwordFootnote 8 and punctuation removal, and stemming;

  • the (normalized) number of occurrences of each part of speech tag, as indicated by the NLPNet tagger [9];

  • the (normalized) number of occurrences of semantic classes, as indicated by LIWC for Brazilian Portuguese [2], which is an enriched lexicon that associates to each word one or more possible semantic classes (from a set of 64 available classes);

  • and the pausality, emotiveness, (normalized) uncertainty and (normalized) non-immediacy features.

Still following the work of [16], we used an evaluation strategy of 5-fold cross-validation. We computed the traditional precision, recall and F-measure metrics for each class, as well as general accuracy. Table 5 shows the average results that we achieved for different feature sets. The first three rows refer to features of [16], while the fourth is a combination of them; the next four rows are the features of [22], also followed by their combination; we then combine the best features of both initiatives; and, finally, we combine all the features (in the last row).

Table 5. Classification results.

Bag of words alone could (surprisingly) achieve good results (88% of F-measure, for both true and fake news), and other features (including the ones of [22]) did not help to significantly improve this. It is interesting that most of the methods performed similarly for the two classes.

We show in Table 6 the confusion matrix for the bag of words classification. One may see that there is still room for improvements. In our opinion, misclassifying (and, consequently, filtering out) true news is more harmful than not detecting some fake news (the same logic of spam detection), and this must have more attention in the future.

Table 6. Confusion matrix for bag of words classification.

We checked that the classification errors are correlated with the news categories in the following way: 11.6% of the political texts were misclassified; for TV & celebrities, 10.4%; for society & daily news, 12.3%; for science & technology, 16.1%; for economy, 18.1%; and, for religion, 20.4%. Economy and religion categories appear to be the most difficult ones, but this may have happened due to fewer learning instances that we have for such categories.

We have also run some other machine learning techniques, from different paradigms, as Naïve-Bayes, Random Forest, and Multilayer Perceptron (with the default parameters of Scikit-learn). Additionally, we tried bag of words with different minimum numbers of occurrence in the corpus, as well as other values for the occurrence of words, as their (normalized) frequency (instead of boolean 0 or 1 values). Multilayer Perceptron could achieve 90% of accuracy. Considering words with at least 3 occurrences produced the same results; from 5 to more occurrences, the results start to slightly fall. Using word frequency (instead of boolean values) did not improve the results.

One final test was to run the experiments without truncating the size of the texts. The use of full texts achieved impressive 96% of accuracy with bag of words, but this classification is probably biased, as true texts are significantly longer than the fake ones.

It is interesting that, in our case, differently from [16], part of speech tags did not produce the best results. Such difference is probably explained by the dataset. While our dataset is “spontaneous” (to the extent that such nomenclature makes sense for fake news), collected from the web, [16] used a dataset of a different nature (in fact, the authors used sentences), produced by hired people to the task.

Overall, the achieved results were above our expectations. One factor that may help explaining such good results is that we have filtered out news with half truth, which might make things more complex (and equally interesting). This remains for future work, as we comment below.

5 Conclusions

To the best of our knowledge, we have presented the first reference corpus for fake news detection for Portuguese - the Fake.Br corpus. More than this, we have run some experiments, following some well known attempts in the area, and produced good results, considering the apparent difficulty of the task. We hope that our corpus may foster research in the area and that the methods we tested instigate new ones in the future.

For future work, we hope to identify other features that may help distinguishing the remaining misclassified examples, as well as to test other classification techniques, using, e.g., distributional representations and methods. We also aim at dealing with other deception types, such as satiric texts and fake opinion reviews, and with more complex cases, as the news including half truth.

More information about this work and the related tools and resources may be found at the OPINANDO project websiteFootnote 9.