1 Introduction

The last decade has witnessed the rapid growth and success of online social networks, which has disrupted traditional media by fundamentally changing how, who, when, and where on the distribution of the latest news stories. Unlike traditional newspapers or magazines, anyone can spread any information at any time on many open and always-on social media platforms without real-world authentications and accountability, which has resulted in unprecedented circulation and spreadings of fake news, social spams, and misinformation [1,2,3,4,5].

Driven by the political or financial incentives, the creators of fake news generate and submit these well-crafted news stories on online social media, and subsequently recruit social bots or paid spammers to push the news to a certain popularity [6,7,8]. The recommendation and ranking algorithms on social media, if failed to immediately detect such fake news, likely surface such news to many other innocent users who are interested in the similar topics and content of the news, thus leading to a viral spreading process on social media. These rising social spams [9], click baits [10] and fake news [1], mixed with real news and credible content, create challenges and difficulties for regular Internet users to distinguish credible and fake content.

Towards effectively detecting, characterizing, and modeling Internet fake news on online social media [11], this paper proposes a new framework which systematically characterizes the Web sites and reputations of the publishers of the fake and real news articles, analyzes the similarity and dissimilarity of the fake and real news on the most important terms of the news articles via tf-idf and LDA topic modeling, as well as explores document similarity analysis via Jaccard similarity measures between fake, real and hybrid news articles.

The contributions of this paper are three-fold:

  • We systematically characterize the Web sites and reputations of the publishers of the fake and real news articles on their registration patterns, Web site ages, and the probabilities of news disappearance from the Internet.

  • We analyze the similarity and dissimilarity of the fake and real news on the most important terms of the news articles via term frequency - inverse document frequency (tf-idf) and Latent Dirichlet allocation (LDA) topic modeling.

  • We explore document similarity between fake, real or hybrid news articles via Jaccard similarity to distinguish, classify and predict fake and real news.

The remainder of this paper is organized as follows. Section 2 describes the background of the fake news problem over online social media and describes data-sets used in this study. Section 3 characterizes the Web sites and reputations of the publishers of the fake and real news articles, while Sect. 4 focuses on analyzing the similarity and dissimilarity of the fake and real news on the most important terms of the news articles. In Sect. 5, we show the promising direction of leveraging document similarity to distinguish fake and real news by measuring their document similarity. Section 6 summarizes related work in detecting and analyzing fake news and highlights the difference between this effort with existing studies. Finally, Sect. 7 concludes this paper and outlines our future work.

2 Background and Data-Sets

As online social media such as Facebook and Twitter continue to play a central role in disseminate news articles to billions of Internet users, fake and real news share the same distribution channels and diffusion networks. The creators of fake news, motivated by a variety of reasons including financial benefits and political campaigns, are very innovative in writing the news stories and attractive titles that convince thousands of regular people to read, like, comment, forward. Such high engagement in a short time period can make the news go viral with little challenges or doubts on authenticity, verification or fact checking.

In this paper we explore the research data shared from a recent study in [12]. The data consists three data-sets, each of which includes hundreds of fake and real news stories over a 3-month time-span from dozens of fake news sites as well as well-respected major news outlets including New York Times, Washington Post, NBC News, USA Today, and Wall Street Journal. These three data-sets are referred to as dataset 1, dataset 2, and dataset 3 throughout the rest of this paper. For each fake or real news article, the data includes the tile of the story, the Web URL of the news story, the publisher of the news and the total engagement, measured by the total number of shares, likes, comments, and other reactions of the news received on Facebook.

3 Characterizing Fake and Real News

In this section, we study a variety of subjective features on the publishers of real and fake news such as the registration behaviors of publishers’ Web sites, the sites ages of the publishers, and the probability of the news disappearance on the Internet.

3.1 Web Site Registration Behavior of the Publishers

The real or fake news publishers typically have to go through the domain registration process, which allows anonymous domain registrar to serve as a proxy for publishers who prefer to hind their identities. If a publisher chooses to remain anonymous, the Internet whois database will show the proxy, e.g., Domains By Proxy, LLC as the registration organization. Most popular and well known newspaper typically choose to use the real organization name during the registration process. For example, the registration organization for wsj.com is Dow Jones & Company, Inc, which owns Wall Street Journal newspaper.

Our findings show that the majority of the fake news publishers register their Web sites via proxy services to remain anonymous, while all the real news publisher use their real identifies during the domain registration process. As shown in Table 1, over 78% of the domains publishing fake news are registered via proxy services to hide their true identities of the domain owners, while less than 2% of the domains publishing real news are registered in such a fashion. Thus we believe such patterns can become a powerful feature for machine learning models to distinguish fake and real news.

Table 1. Domain registration with proxy service for hiding domain owners’ identify
Fig. 1.
figure 1

The Web site age distribution of the fake news publishers vs. real news publishers

3.2 Internet Site Ages of the Publishers

Beside the domain registration behavior, we also study the ages of the domains for the fake and real news in three data-sets. For each data-set, we characterize the domain age distribution for the fake and real news, respectively. As illustrated in Fig. 1, all data-sets exhibit consistent observations which reveal the very short domain ages for fake news, and the long domain ages for real news. This result is not surprising in that the credible newspapers registered their domains in early 1990s when Internet and Web start to attract attentions, while the fake news driven publishers often temporarily register the sites for the purpose of spreading fake news in a very short time of period.

3.3 Probability of News Disappearance

Credible news agency tends to maintain high quality sites that keep the published news for a long time. However, fake news sites often take the news offline after achieving the short-term goals of misleading the readers. Our analysis on the fake and real news corpus confirm such common practice.

Table 2. Page not found due to news disappearing.

As shown in Table 2, the three data-sets of fake news corpus exhibit consistent news disappearing patterns, while the real news corpus has zero news that are taken offline. Thus we believe news disappearance could become a valuable feature for differentiating or modeling fake and real news.

In summary, our preliminary results on these popular fake and real news reveal substantial difference between fake and real news on the quality of the new pages, as well as the reputations of the publishing domains reflected by the domain ages as well as the interesting usage of the registration proxies.

4 Topics and Content of Fake and Real News

In this section, we first identify the most important topics of each fake or real news article via tf-idf analysis [13]. Subsequently, we explore the probabilistic LDA topic model to understand the difference or similarity of topics between labeled fake and real news.

4.1 Important Topics Identifications via tf-idf Analysis

tf-idf (term frequency - inverse document frequency) is a widely used statistical technique for extracting the most important term, t, or word, w, of a document in a document corpus, D. The tf-idf value of a term t, in the document d, tf-idf(t, d), is a product of the term frequency tf(td) and the inverse document frequency idf(tD), i.e.,

$$\begin{aligned} tf - idf(t, d) = tf(t, d) * idf(t, D). \end{aligned}$$
(1)
Table 3. Top terms ranked by tf-idf values in fake, real and hybrid news corpus

Table 3 shows that the most important terms extracted from fake, real and hybrid news corpus via tf-idf analysis are quite similar, thus relying on these terms alone is inefficient for detecting fake news.

4.2 Latent Dirichlet Allocation Topic Modeling

Topic models are widely used for understanding the content of documents based on word usage. In this paper, we explore Latent Dirichlet Allocation (LDA), a probabilistic topic model, to capture the topics of fake, real and hybrid news corpus respectively. The goal of LDA topic modeling on fake and real news is to understand the difference or similarity of topics between labeled fake and real news.

Table 4. LDA topics for fake news corpus
Table 5. LDA topics for real news corpus
Table 6. LDA topics for hybrid fake and real news corpus

Tables 4, 5 and 6 illustrate the three topics with 5 most frequent terms for each corpus. As shown in these tables, the fake and real news share strong similarity in the overall topics, thus topic model alone is not an effective approach to detect or differentiate fake or real news in the real world.

5 Document Similarity Analysis for News Predictions

As the LDA topics are inefficient to distinguish fake and real news, our followup analysis to explore document similarity between fake, real or hybrid news articles. First, we randomly divide the labeled fake and real news into training sets and test sets with a spit ratio of 67% for training and 33% for test.

For each fake or real news n in the test corpus, we measure the document similarity between n and every news in the fake news training set \(\mathcal {F}\) and the real news training set \(\mathcal {R}\). In particularly, we calculate Jaccard similarity \(J(doc_1, doc_2)\), a widely used similarity measure between two documents \(doc_1\) and \(doc_2\) with the following equation

$$\begin{aligned} J(doc_1, doc_2) = \frac{doc_1 \cap doc_2}{doc_1 \cup doc_2}, \end{aligned}$$
(2)

where \(doc_1\) and \(doc_2\) are represented with the vectors, typically sparse, of terms in the documents.

Fig. 2.
figure 2

The prediction on fake and real news based on labeled fake and real news corpus

Figure 2(a) shows the fake news in the test set have a much higher average document similarity with the news in the fake news training set \(\mathcal {F}\) than with those in \(\mathcal {R}\). However, Fig. 2(b) shows the real news in the test set have surprising similar document similarity with the news in the real news training set \(\mathcal {R}\) and with those in \(\mathcal {F}\). Thus as shown in Fig. 2(a), document similarity can potentially detect fake news. One of our future work is to systematically quantify the precision and recall of detecting both fake and real news in a large-scale news corpus.

In summary, our preliminary analysis on the topics and content of fake and real news reveals that it is very challenging to simply exploring the tf-idf and LDA topic modeling to effectively detecting fake news. However, our study also shows the promising aspect of leveraging document similarity to distinguish fake and real news by measuring the document similarity of the news under tests with the known fake and real news corpus.

6 Related Work

In recent years, several algorithms [1, 2, 8, 14,15,16,17,18,19,20] have been proposed to detect the dissemination of information, misinformation or fake news. For example [2] exploits the diffusion patterns of information to automatically classify and detect misinformation, hoaxes or fake news, while [8] proposes linguistic approaches, network approaches, and a hybrid approach combining linguistic cues and network-based behavior insights for identifying fake news. In addition, [1] reviews the data mining literature on characterizing and detecting fake news on social media.

Similarly, [14] proposes a SVM-based algorithm for predicting misleading news with predictive features such as absurdity, grammar, punctuation, humor, and negative affect, and [15] uses logistic regression to distinguish credible news from fake news based on n-gram linguistic, embedding, capitalization, punctuation, pronoun use, sentiment polarity features. A recent effort in [16] formulates the fake news mitigation as the problem of optimal point process intervention in a network, and combines reinforcement learning with a point process network activity model for mitigating fake news in social networks.

In addition, [21] classifies the task of fake news detection into three different types: serious fabrications, large-scale hoaxes, and humorous fakes, and discusses the challenges of detecting each type of fake news. To address the lack of labeled data-sets for fake news detection, [22] introduces a real-world data-set consisting of 12,836 statements with real or fake labels. In [17], the authors locates the hidden paid posters who get paid for posting fake news via modeling the behavioral patterns of paid posters.

7 Conclusions and Future Work

As fake news and disinformation continue to grow in online social media, it becomes imperative to gain in-depth understanding on the characteristics of fake and real news articles for better detecting and filtering fake news. Towards effectively combating fake news, this paper characterizes hundreds of very popular fake and real news from a variety of perspectives including the domains and reputations of the news publishers, as well as the important terms of each news and their word embeddings. Our analysis shows that the fake and real news exhibit substantial differences on the reputations and domain characteristics of the news publishers. On the other hands, the difference on the topics and word embedding shows little or subtle difference between fake and real news. Our future work is centered on exploring the word2vec algorithm [23], a computationally-efficient predictive model based on neural networks for learning the representations of words in the high-dimensional vector space, to learn word embedding of the important words or terms discovered via the aforementioned tf-idf analysis. Rather than comparing the few important words of each new article, word2vec will allow us to compare the entire vector and embeddings of each word for broadly capturing the similarity and dissimilarity of the content in the fake or real news.