Keywords

1 Introduction

The term “fake news”, defined as “false, often sensational, information disseminated under the guise of news reporting” [6], gained so much attention that it was named the Collins Word of the Year 2017 due to its unprecedented usage increase of 365% in the Collins Corpus [6]. Even though the concept of news articles aimed to mislead readers is by no means new [24], it seems to exist a relationship between the very expression “fake news” with the 2016 presidential election in the United States of America: Davies [8], using data from the NOW Corpus, shows that “there is almost no mention of ‘fake news’ until the first week of November [2016] (...) and then it explodes in Nov 11–20, and has stayed very high since then”. The author adds that the reason “why people all of the sudden started talking about something that had really not been mentioned much at all until that time” was “the US elections, which were held on November 9, 2016”.

The sudden popularization of an already existing term (that is, not a neologism) in a language poses interesting questions regarding how concepts around this term are perceived by the speakers of that language. We might ask, for instance: what changed (if anything) in terms of conceptualization of this expression after its boom? Was there any kind of shift in the meaning of this expression when it became widely employed? If so, was this shift uniform across different varieties of the language? These are some of the issues of interest in lexicology, the area of linguistics focused in the study of the lexicon, that has been fostered thanks to advances in corpus linguistics, concerned with the use of big real-world corpora to the study of language.

The goal of this article is to provide a closer look at how newspapers and magazines across the world shape the term “fake news” – which is a relevant social phenomenon linked to misinformation and manipulation, and that has been facilitated by the rise of the Internet and online social media in recent years. We investigate the perception and the conceptualization of this expression through the quantitative analysis of a large corpus of news published in 20 countries from 2010 to 2018, thus making it possible to examine not only the diachronic development of this term, but also its synchronic usage in different parts of the English-speaking world. We complement our investigation with data collected from online search queries that help to measure how the public interest in the expression “fake news” and in the concepts around it changed over time in different places.

1.1 Related Work

In 2010, Michel et al. [17] coined the term culturomics meaning a method to study human behavior, cultural trends and language change through the quantitative diachronic analysis of texts, including of digitized books provided by the project Google Books. Several studies explore this method to investigate topics such as the dynamics of birth and death of words [20], semantic change [12], emotions in literary texts [1] and general characteristics of modern societies [22]. However, many criticisms arose regarding limitations of inferences derived from the analysis of Google Books due to factors that range from optical character recognition errors and overabundance of scientific literature [19] to the lack of metadata in the corpus [13].

Leetaru [15] proposes a somewhat complementary approach that he calls culturomics 2.0, which uses historical news data instead of books and can, according to the author, “yield intriguing new understandings of human society”. In the same vein, Flaounas et al. analyze the European mediasphere [11] and the writing style, gender bias and the popularity of particular topics [10] in large corpora of news articles. Landsall-Welfare et al. [14], also using a large dataset of media reports, observe a change of framing and sentiment associated with nuclear power after the Fukushima nuclear disaster. They detected effects on attention, sentiment, conceptual associations and in the network of actors and actions linked to nuclear power following the accident.

In this investigation, we combine many of the methods employed in the related works mentioned above. However, as far as we are concerned, this is the first paper that uses these methods to examine in details how the relevant term “fake news” is being reported by news media in different parts of the world and in two distinct periods in the history of this expression.

1.2 Research Question

Our main research question is: was the rise of the public interest in the term “fake news” accompanied by changes in its conceptualization and in the perception about it? Based on sociolexicological theories, that defend the existence of a considerable relationship between linguistic and extralinguistic factors with regards to the vocabulary of a language [5, 16], our hypothesis is that the change of interest in the phenomenon fake news might have altered the general usage of the expression referring to it. Indeed, the results obtained in our investigations indicate, in general, a positive answer for our research question. Among other findings, we show modifications in the related vocabulary and in the mentioned entities accompanying the term “fake news”, in addition to changes in the topics associated to this concept and in the overall contextual polarity of the pieces of text around this expression in media articles after 2016.

This article is structured as follows: in the next section, we present the process of acquisition and preparation of the main data source used in our investigations; in Sect. 3, we describe our analyses, present the found results and discuss their implications; finally, in Sect. 4, we summarize the outcomes of our study and conclude this paper by discussing possible future outlooks.

2 Data Source

The main dataset used in this study comes from the Corpus of News on the Web (NOW Corpus), which contains articles from online newspapers and magazines written in English and based in 20 different countries from 2010 to the present time [7]. This corpus is available for download and online exploration at https://corpus.byu.edu/now/ and, according to its authors, “is [as of April 2018] by far the largest corpus (of any language) that is available in full-text format”. Our analyses are relative to a version of the corpus available in the month of April 2018, containing around six billion words of data.

In this dataset, we searched for all the occurrences of the term “fake news”. For each occurrence, the online version of the corpus provides a concordance line, or context – that is, a piece of text of approximately 20–30 words around (before and after) the searched term. For example, for a certain news article published in July 25 2017 in the Kenyan newspaper Daily Nation, the context around the term “fake news” is: (...) of social media and a study that said 90 per cent of Kenyans had encountered fake news. WhatsApp and Facebook are the two leading sources of misinformation, often (...). All of our analyses were performed in these contexts, since words immediately surrounding a key term are more relevant to the conceptualization of this term than words further away from it, though in the same text. Wynne [27] adds that the main reason for using keywords in context (KWICs) in corpus linguistics is that “interesting insights into the structure and usage of a language can be obtained by looking at words in real texts and seeing what patterns of lexis, grammar and meaning surround them”.

The total number of occurrences of “fake news” extracted from the NOW Corpus in April 30 2018 is 41,124. These occurrences encompass news articles published in all the 20 countries represented in the corpus, that were grouped in six regions based on their geographical locations (Africa, British Isles, Indian subcontinent, Oceania, Southeast Asia and The Americas), since it has been observed that offline and online news outlets tend to give preference to local and national news, to domesticate news about other countries and to reflect imbalanced information flows between the developed and the developing worlds [2].

These occurrences also cover each year in the corpus (from 2010 to 2018). Due to the previously observed increase in the usage of the term “fake news” during and after the 2016 presidential election in the United States of America (mentioned in Sect. 1), we categorized the occurrences in two periods: before and after the 2016 US election. The election was held in November, but we set the delimitation date between these periods in the end of the first semester of 2016 (June 30) in order to include the political campaign in the period after US election. Table 1 shows the number of contexts containing the term “fake news” in our dataset according to the geographical origin of the corresponding news media and the year and period of publication of the news article.

Table 1. Number of contexts containing the term “fake news” in our dataset according to (a) the geographical origin of the corresponding news media and (b) the year and period (before or after the 2016 presidential election in the United States of America) of publication of the news article.

3 Analyses and Results

In this section, we display and examine the outcomes of our investigations. Each analysis is introduced by a description of how it is able to contribute answering to our research question, followed by the methodology employed and finally by a presentation and discussion of the results found.

3.1 Web Search Behavior

Before analyzing the data obtained from the NOW Corpus, we investigate whether it is possible to observe a change in Web search behavior regarding the expression “fake news” corresponding to the high increase in its use during and after the 2016 presidential election in the United States of America mentioned in Sect. 1.

Data obtained from Google TrendsFootnote 1, an online tool that indicates the frequency of particular terms in the total volume of searches in the Google Search engine, displays that public interest in the term “fake news” was approximately constant from 2010 until 2016, when it greatly and suddenly increased, as indicated by Fig. 1. This data also shows that, in the period before the 2016 US presidential elections, most of the countries with the highest proportions of searches for the term “fake news” were from the Eastern world. However, after the US election, the proportion of searches for this expression in Western countries increased considerably, especially in Europe. The 10 countries with the highest proportion of searches for the term “fake news” in both periods are listed in Table 2.

Fig. 1.
figure 1

Normalized volume of searches for the expression “fake news” on Google Search from 2010 to 2018.

Table 2. Countries with the highest proportion of searches for “fake news” on Google Search before and after US election.
Table 3. Most frequent search terms related to “fake news” on Google Search before and after US election.

A closer look at the data from Google Trends also reveals that the great increase in the public interest for the expression “fake news” coincided with a change in the focus of Web searches. Table 3 shows the five most frequent search terms employed by users who also searched for “fake news” in the periods before and after the US election. We observe that, before the US election, searches for “fake news” were generic and regarded topics related to the media industry itself, like “article”, “stories” and “report”; after the US election, however, these searches started to be more focused on political affairs and in the spread of fake news, mentioning entities like the elected president of the United States of America in 2016 (Donald Trump), the television news channel CNN (that devotes large amounts of its coverage to US politics) and the social media Facebook (considered a major source of fake news on the Internet).

In this section, we used data obtained from the Google Trends tool. From the next section on, however, all of our analyses use the data described in Sect. 2, obtained from the NOW Corpus.

3.2 Co-occurring Named Entities

The analysis of named entities – that is, real-world entities such as persons, organizations and locations that can be denoted with proper names [26] – co-occurring with certain terms is an interesting way to contextualize these concepts. In our case, by identifying which entities are linked to the expression “fake news” in different periods of time and in different parts of the world, we are able to observe relationships of “who and where” in the recent history of our key-term.

In our dataset of news articles, we employed a simple method to identify named entities: we made use of the fact that newspapers and magazines consistently capitalize nouns representing named entities and counted all the words that appear capitalized in the contexts; then, we manually analyzed the most frequent capitalized words in each subdivision of the corpus (i.e. representing each region and period) to remove words not relative to named entities (such as “I”, “SMS”, “March” and words capitalized for other reasons) and to merge duplicated entities represented more than once (e.g. “Donald” and “Trump”).

Table 4 shows the five most mentioned named entities in the periods before and after the 2016 US presidential election, regardless of geographical origin of the corresponding news media. Before the US election, it is possible to observe a strong connection between humor and fake news: with exception of Facebook, all the other most mentioned named entities are related to satirical TV shows and hosts based in the United States of America. On the other side, in the period after the US election, there is a movement towards politically related entities (Donald Trump), traditional media sources (CNN) and social networking services (Facebook and Twitter). It is interesting to notice that this shift matches the already mentioned (in Table 3) shift of interest towards political affairs and the spread of fake news observed in Web searches.

Table 4. Most mentioned entities in the periods before and after US election.

When we make this same diachronic comparison, but now considering the geographical origin of the corresponding news media, we observe a noteworthy phenomenon: the global standardization of the named entities related to fake news. Table 5 indicates that local entities are more relevant in the period before the US election, when names of geographical regions (Ekiti), countries (Nigeria, China), local political parties (PDP – People’s Democratic Party of Nigeria, BJP – Bharatiya Janata Party of India) and local personalities (Shahid Afridi, King Salman, Korina Sanchez) appear frequently among the most mentioned entities. In the contexts after the US election, however, Donald Trump, Facebook and US are the three most mentioned entities for nearly all the regions – with the sole exception of The Americas, where CNN replaces US.

Table 5. Most mentioned entities in the periods before and after US election, considering the geographical origin of the corresponding news media.

3.3 Semantic Fields of the Surrounding Vocabulary

Besides the investigation of the named entities that accompany a given key-term, the analysis of the general vocabulary co-occurring with it is also valuable. In our case, one of the possible methods of performing such analysis is by observing the semantic fields (i.e. groups to which semantically related items belong) of the words co-occurring with the expression “fake news” in our contexts.

For performing this task, we first lemmatized all the words in the contexts by employing the WordNet Lemmatizer function provided by the Natural Language Toolkit [3] and using verb as the part-of-speech argument for the lemmatization method. By applying this lemmatization, we grouped together the inflected forms of the words so that they could be analyzed as single items based on their dictionary forms (lemmas).

Then, we used Empath [9], “a tool for analyzing text across lexical categories”Footnote 2, to classify the lemmatized words according to categories that represent different semantic fields, such as diverse topics and emotions. For every context, we calculated the percentage of words belonging to each semantic field represented by an Empath category. Due to the high number of categories predefined by Empath (194 in total), we selected eight that showed interesting results and are relevant for our discussion: government, internet, journalism, leader, negative emotion, politics, social media and technology. By way of example, the category internet includes 79 words such as homepage, download and hacker, while the category journalism contains 69 words, including report, article and newspaper.

Figure 2 displays the average percentage of words in these categories for all the six regions considered here, both before and after the 2016 US election. By analyzing the graphs presented, we observe interesting differences and trends regarding the quantitative utilization of words from the semantic fields considered. We highlight the high increase in the use of words from the related categories government, leader and politics (and also from the supposedly unrelated category negative emotion) and the high decrease in the use of words from the categories internet, journalism and technology (but not social media) in almost all regions after the US election.

Fig. 2.
figure 2

Percentage of words in each semantic field represented by an Empath category. Error bars indicate standard errors.

We hypothesize that these results indicate a change in the focus of the news considered here: before the 2016 US election, the term “fake news” was probably more mentioned in contexts in which the focus was the environment where they occur (Internet, newspapers etc.), sometimes even meta-discussions on the very topic of fake news and its dissemination; during and after the US election, however, the discussion seems to have migrated to themes more close to the content of the fake news themselves (politics, elections etc.).

3.4 Co-occurrence Networks

Another possible method of investigating the vocabulary accompanying a key-term in a corpus is through the observation of co-occurrence networks. In our case, this method enables us to visually analyze the words that co-occur with the expression “fake news” in the contexts that we are considering. Here, we compare co-occurrence networks between the periods before and after the 2016 US election, regardless of the geographical origin of the media outlets. These networks are represented here by graphs, in which each node corresponds to a word and each edge corresponds to an association between two given words.

Fig. 3.
figure 3

Co-occurrence networks of words before and after the 2016 US election.

To build our graphs of co-occurring words, we followed the steps below. First, we counted the number of contexts in which two given words co-occur, so that we have the volume of co-occurrences for each pair of words. Then, we normalized these values in order to work with percentages instead of working with the absolute number of co-occurrences. To improve the visualization of the graphs, we filtered out minor relationships and highlighted the strongest ones by removing nodes and edges when the co-occurrence percentage was lower than 0.8% (for the period before the US election) and 0.5% (for the period after the US election), and by plotting the width of the edges proportionally to the strength of association between two given words. Finally, we obtained the maximum spanning trees of both graphs, which are presented in Fig. 3.

This method of investigation enables us to make several qualitative observations. Comparing the two graphs, we notice clear changes in the relationships between the words co-occurring with “fake news” in our contexts. For instance, before the US election, the main cluster contains words related to the news industry itself (“article”, “stories”, “hoax”) and to Internet (“website”, “twitter”, “facebook”, “account”). Corroborating previous findings (Sect. 3.2), there is also another cluster containing words referring to satirical TV shows and hosts (“daily”, “show”, “colbert”, “oliver”, “stewart”). In the graph representing the period after the 2016 US election, we start to observe terms linked to specific events, mainly the US election itself. Interestingly, some terms that surround meta-discussions about fake news also appear, highlighting relevant related concepts: fact check, hate speech, post truth and alternative facts.

3.5 Topics Addressed in the Contexts

In addition to studying the vocabulary around a key-term, it is also possible to find the main topics addressed in the pieces of text surrounding the occurrences of the expression “fake news” in our corpus.

For this task, we used latent Dirichlet allocation (LDA) [4], a way of automatically discovering topics discussed in texts. First, we lowercased and tokenized all the words in the dataset. Then, we removed stop words using the list provided by the Natural Language Toolkit – after having added the words “fake” and “news” to this list, since they appear in all contexts. Finally, we ran the LDA algorithm using gensim [21], a Python library for topic modeling. We used topic coherence score [18] to choose the optimum number of topics \(k\) to be returned by the algorithm. Thus, for each region, we ran the LDA algorithm starting with \(k\) = 2 and ending with \(k\) = 20, and chose the best LDA model, that is, the LDA model with highest topic coherence score. All regions had, respectively, \(k\) = 2 and \(k\) = 14 for the periods before and after the US election, except The Americas, that had \(k\) = 8 and \(k\) = 14. For each region, the LDA returned these \(k\) topics containing words ordered by importance in the corresponding context, filtered both by region and topic. We then selected the most important topic as the representative of each region and period. Table 6 shows the main topic for each region in both periods (before and after the US election) and the top ranked ten words produced by our LDA model.

Unlike in the period before the US election, the words related to each topic inferred by LDA are cohesive among each other in the period after the US election. We observe, for all regions, a relevant frequency of words related to politics and social networks in the period after the US election. More specifically, the words “russian”, “russia”, “election” and “facebook” rank high in this period. In the period before the US election, most of the top words are related to Internet and to the spread of misinformation, in all regions.

Table 6. Main topic for each region. Inside each topic, ten words are presented in order of importance according to the LDA output.

3.6 Polarity

Our final analysis explores a different feature of the contexts in which the expression “fake news” appear in our dataset: their polarities, that is, whether the expressed opinion in the texts is mostly positive, negative or neutral. Here, we performed sentiment analysis [23] in each one of the contexts in our dataset using SentiStrengthFootnote 3 [25], a tool able to estimate the strength of positive and negative sentiment in short texts. Given a piece of text, this tool returns a score that varies from −4 (negative sentiment) to +4 (positive sentiment).

We are interested in analyzing how polarity changes over time and in different regions when it comes to “fake news” and how this can be perceived in our dataset. Figure 4 depicts the average polarity of the contexts in each region before and after the 2016 US presidential election. We first observe a clear dominance of negative polarities in all periods and regions, indicating that the term “fake news” is often related to negative words [28] and sentiments – which is not surprising, since the concept of fake news seems to be strongly associated with negative concepts, like misinformation, manipulation and the spread of untrue facts.

Fig. 4.
figure 4

Average polarity of the contexts in each region before and after US election (bars indicate the standard error of the mean).

In this figure, we also observe that, in general, the polarity expressed in the contexts in the period after the US election is more negative than before. The only exception is in the British Isles, where the difference of polarity between the periods is not statistically relevant. This result seems to corroborate findings presented in previous sections of this article, which demonstrated that, before the 2016 US election, the term “fake news” was often linked to satirical TV shows and more general topics, while in the period during and after the election the topics became more related to the spread of false information in the context of political activity.

3.7 Summary of Results

The most relevant outcomes of the analyses presented in this section can be summarized and integrated as follows:

  • the interest for the term “fake news” suddenly increased after the 2016 US election, as indicated by the rise of news about it and of Google Search queries for this expression (Sect. 3.1);

  • this growth was accompanied by a change of framing around the term “fake news” – from, for instance, topics regarding the media industry itself to those related to political affairs (Sects. 3.1, 3.3, 3.4 and 3.5);

  • the named entities linked to the expression “fake news” not only changed towards political topics, but also suffered from global standardization after the US election (Sect. 3.2);

  • the negativity of the news containing the term “fake news” increased after the US election (Sect. 3.6).

All these results suggest that, as hypothesized in Sect. 1, the rise of public interest in the term “fake news” came with changes in its conceptualization and in the perception about it.

4 Concluding Remarks

Due to the increased role of the Internet in modern societies, topics regarding misinformation and manipulation in online environments seem to be subject to progressively more public debate and interest, including from the traditional media. Understanding how these topics are viewed through the eyes of opinion leaders is crucial to comprehend how public opinion about them is being shaped in present day.

In this article, we present a quantitative analysis on the perception and conceptualization of the term “fake news” in a corpus of news articles published from 2010 to 2018 in 20 countries. We investigate how media sources have been reporting topics related to fake news and whether the rise of the public interest in this very expression during and after the 2016 presidential election in the United States of America was accompanied by changes of perception and shifts in sentiment about it. We observed changes in the vocabulary and in the mentioned entities around the term “fake news” in our corpus, in the topics related to this concept and in the polarity of the texts around it after 2016, as well as in Web search behavior of Google Search users interested in this concept.

We are also interested in understanding whether the term “fake news” is framed differently across the globe – and, if so, which are these differences. The existence of such variations may result in different shifts in the meanings and in the sentiments around these concepts in various regions of the world, which justifies this study as a way to more clearly understand how the public opinion is being steered in the current context in different countries of the English-speaking world.

In this paper, we analyzed the usage of the term “fake news” in a diachronic perspective, but only considered two historical moments: before and after a key event in the history of this expression (the 2016 US presidential election). In the future, we plan to consider a larger spectrum of periods, particularly to understand whether (and, if it is the case, when) the conceptualization of “fake news” changed once again. We also intend to add analyses using data from other relevant sources, including Twitter posts and Wikipedia edits.