A Technique to Detect Fake News Using Machine Learning

Yadav, Pritee; Hasan, Muzammil

doi:10.1007/978-981-16-2354-7_29

Pritee Yadav³⁹ &
Muzammil Hasan³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 768))

854 Accesses
2 Citations

Abstract

The impacts of data spread happen at such a quick pace on social networks and so enhanced that that distorted, mistaken data or false data gets an enormous potential to cause certifiable effects. We examine the fake news issue, and the present specialized concerns related to the need for the robustness of automated fake news detection frameworks, and also examine the opportunity that is appropriately developed. In this paper, we have proposed a systematic framework for fake news identification and gives it’s the confusion matrix with a classification report of the accuracy of their perspective model. The proposed approach with comparisons accomplishes classification accuracy 94% and 97% recall. We collect the dataset of real and fake news that is changed from document-based corpus into an event and title-based representation.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Automatic Fake News Detection: A Review Article on State of the Art

Survey on Fake News Detection Techniques

A Comparative Study of Various Machine Learning (ML) Approaches for Fake News Detection in Web-based Applications

Keywords

1 Introduction

Data veracity is an issue influencing society for advanced media. The sensationalism of not precise eye-getting and features that are intriguingly planned for holding the consideration of audiences to sell data has persevered all since the commencement of a wide range of data communications.

As reported, youths are to be technically knowledgeable when contrasted with their parents [1], but as it may, with respect to the capacity to tell if a news piece is phony or not, they show up as bewildered as the overall public’s leftover portion and the examination directed by the Common Sense Media have 44% affirmed of the assessment. Comparative research similarly shows that 31% of kids developed 10–18 have shared online, at any rate, one report that they later found was phony or off base. The present situation expands an altogether unique component of concerns recognized with computerized education that goes past the simple capacity to get to and oversee innovation.

Along with cultural difficulties, there is an inconspicuous and dramatic situation happening in the media scene, news-casting industry, and the public sphere that requires assessment and discussion, bringing up two fundamental perspectives. The first relies upon the way that news distributors have lost order over the news circulation, which is introduced to users on social media by algorithms and unpredictable. Also, the news advertises beginners, (for example, Vox, Fusion, and Buzz Feed) have built their quality by grasping these techniques, sabotaging the drawn-out positions involved by distributors of traditional news increasingly. The subsequent viewpoint depends on the expanding power that online networking media, e.g. Amazon, Facebook, Google and, Apple, have picked up in controlling that how the distributions are adapted and who distributes what to whom.

In the above context, building up the unwavering quality of online data daunting but basic current test [2], requesting the guideline, attention, and dynamic observing of computerized content spread by the significant gatherings including web indexes and person to person communication stages, in supporting how the data is introduced and shared among individuals over the Internet. For the Commons Culture, The subject “Fake news” has become so predominant. Sport and Media Committee is right now examining worries about the general population being influenced by untruths and propaganda [3]. The prediction for the odds is defined as detection of fake news of a specific article of news (editorial, expose, news report, etc.) to be deliberately deceiving [2, 4].

2 Literature Review

Fake news detection: Fake news is made by creating non-existent news or altering genuine news. The fake news credibility is supported by (1). Writing styles or well-known authors who are imitating. (2). Communicating assessments with a tone, in many cases, used in genuine news. As of late, an expanding number of strategies for detection have been created. All current identification schemes may be collected into two unmistakable classes, to be specific, network-based methods, and linguistic-based methods [5]. Network-based methodologies for the detection of fake news apply to organize properties as a supporting segment for different approaches that are linguistic-based analysis. Usually utilized network properties that include, yet not restricted to, site data, creators/supporters information, like, and time stamps, such as for decreasing the falsehood by performing the user analysis in online social media gathering identified with Parkinson's sickness. This report analysis that deception embedded inside the conversation string relies upon its substance and user’s qualities of the creator. [6] Another examination proposes a model that centers on exploring the crowd-sourced health, the thread question clearness, and potential of the clients for making valuable commitments of the nature of the response in an online source. Syntax analysis and the existing assumption are redone for exceptional information types, in this manner being insufficient for the system for identifying the fake news.

A CNT model that is rumor detection model [7] receives a features assortment, for example, content-based features (part of speech, segments, and word appearance), features of network-based approach (i.e., tweets propagation or re-tweets), twitter-specific Memes (for example, shared URLs or Hashtag), CNT coordinates a variety of methodologies to choose features to identify falsehood in microblogs [8]. Another examination [9] deep-network- models—CNN for news veracity investigation. Sometimes phony-article shows outrageous conduct for an ideological group.

Knowledge-Based-Detection It is the clearest method for distinguishing and check the truthfulness of significant cases in the article to choose the veracity of the news. This approach plans to utilize outside sources for checking the proposed claims in news content. [10, 11], whereas this approach is a key segment of fake news detection as far as the given viewpoints. Information diagram is utilized to check whether the cases can be deduced from realities, which are existing in the chart or not [12,13,14].

Visual-Based Detection—Digitally adjusted pictures are wherever flowing a wildfire on social media. These days’ applications (likewise Photoshop) are being utilized uninhibitedly for modifying pictures satisfactorily to trick individuals into speculation they are seeing the genuine image. The media criminology field has created a significant number of strategies for tampering detection in videos [15].

Wang (2017), Pros: In this paper, the fake news content is classifying by the CNN method. CNN is additionally used for analyzing the assortment of meta-information. Standard content classifiers, for example, SVM models got significant improvement. Cons: Bi-LSTMs did not perform well cause of overfitting [16].

Shu et al. [17], Pros: This framework could be implemented for detecting the fake news in real time. Cons: This framework using only social engagement information is defeated by using only textual data [18] (Fig. 1).

3 Research Methodology

As shown in figure [19], we initially present the framework of methodology that is used for detecting the fake news and this framework presents the training-method’s consistent integration and, testing method and includes these modules: Collection of training data, the pre-processing of data, using TF-IDF and count features trained algorithm and model training and results. Pre-processing of data module coordinates a variety of content procedure strategies to extract topics and from news datasets that are collected from sites that are for the datasets.

Text preparation. The data are highly unstructured of social media—the larger part of them are casual correspondence with grammatical errors, slangs, and typos, etc. For predictive modeling important to clean the information prior to utilizing it. For this reason, fundamental pre-processing was done on the News preparing information.

Feature generation for generating many features we can use the text data such as word count, unique word frequency, large word frequency, n-grams, etc. The words capture their meanings by creating a representation of the words, semantic relationships, they are used in various sorts of context.

TF-IDF vectors (feature): The document also in the corpus for the general significance of a term is represented by the TF-IDF weight.

(a)
TF (term frequency): A term may show up additional the short document in a long size document. Thus, the term frequency is divided by the length of the document frequently.
$$TF(t,d) = \frac{no.\;of\;time\;(t)\;occurs\;in\;an\;individual\;document\;(d)}{{Total\;word\;count\;of\;d}}$$
(b)
IDF (inverse document frequency): Certain terms such as “on”, “an”, “a”, “the”, “of”, etc. seem commonly in the document, however, are of small significance. IDF weighs down the significance of uncommon ones, the significance of these terms and it increments. The more the estimation of IDF, the more one of a kind is the word.
$$IDF\;(t) = \log \;e\left( {\frac{Total\;no.\;of\;d\;document}{{no.\;of\;document\;with\;t\;(term)\;in\;it}}} \right)$$
(c)
TF-IDF: By assigning them less weightage, it is working by punishing the most-normally words that occurs while putting weightage to terms. It has a high occurrence in a specific document.
$$TFIDF\left(t,d\right)=TF\left(t,d\right)*IDF\left(t\right).$$

TF-IDF is a generally utilized feature for content classification (text Classification). Also,

TF-IDF vectors can be determined at various levels have used, for example, N-gram and Word level.

4 Algorithm

This area manages training the classifier. We used explicitly three algorithms of machine learning, which are Naive Bayes Algorithm, Logistic Regression Algorithm, and Passive-Aggressive Algorithm. For detailed information on the algorithm implementation, reader may refer to the example available in [20,21,22,23,24,25].

Naive Bayes: This is one of the classification techniques; Naive Bayes Algorithm is based on the Bayesian theorem. It is a set of the supervised learning algorithm, expects that the presence of a specific segment in a class is autonomous of the presence of some other element and gives the way for the posterior-probability calculation.

$$P\left(y |{ x}_{1,\ldots }{x}_{n}\right)=\frac{P\left(y \right) P\left({x}_{1 ,\ldots }{ x}_{n}\right| y)}{P({x}_{1 ,\ldots }{ x}_{n})}$$

$P\left(y |{x}_{1 ,\ldots }{x}_{n}\right)$= likelihood (posterior probability) of class given predicator.

$P\left({x}_{1,\ldots }{x}_{n} \right| y)$= likelihood (probability) of predictor given class.

$P\left(y\right)$= prior-probability of class.

$P\left({x}_{1,\ldots }{x}_{n}\right)$ = prior-probability of predicator.

The independence assumption by using the Naive condition

$$P\left({x}_{i} \right| y, {x}_{1 }, \ldots , {x }_{i-1},{x}_{i+1}, \ldots {x}_{n})=P ({x}_{i} \left| y\right)$$

The simplified relationship for all i

$$P\left( {y {|} x_{1} , \ldots ,x_{n} } \right) = \frac{{P \left( y \right) \prod\nolimits_{i = 1}^{n} {P{(}x_{i} {|} y)} }}{{P\left( {x_{1} , \ldots , x_{n} } \right)}}$$

Since $P({x}_{1},\ldots {x}_{n})$ is consistent given the value (input), we use the classification rule which is given:

$$P \left(y |{ x}_{1},\ldots {x}_{n}\right)\propto P \left(y\right) \prod_{i=1}^{n}P({x}_{i }|y)$$

$$\Downarrow$$

$$y=\mathrm{arg} max P \left(y\right)\prod_{i=1}^{n} P({x}_{i} |y)$$

also, for estimation, we can use MAP (Maximum A Posteriori) $P(y)$ to $P({x}_{i}|y)$; in the training set, the previous is then the frequency (relative frequency) of class.

Multinomial NB: The multinomial NB is to implement the Naive Bayes algorithm for multinomial conveyed information (distribution of the data), in-text classification, these classic algorithms Naive Bayes and multinomial variants used for text classification. Where the information is commonly represented as word vector count, even though that tf-idf vectors.
Passive-Aggressive Classifier: The algorithm is perfect for grouping huge massive of data (e.g. Facebook, Twitter). The Passive-Aggressive Algorithm is easy to implement and very fast.
Logistic Regression: Logistic Regression is used for the prediction of the event occurrence (true/false, 1/0, yes/no). It is a sigmoid function used for probability estimation. It is the classification algorithm.

5 Evaluation Metrics for Accessing the Performance

In this area, have analyzed probably the significant metrics measurements by which the performance is estimated of ML model. These measurements quantify how well model can evaluate predictions or classify. In this project, the introduction of metrics was used.

Classification Accuracy: It is characterized as the quantity of right predication as against the quantity of all-out prediction. In any case, this metric alone can not give enough data to choose whether the model is a decent one or not. Confusion Matrix: Confusion Matrix shows the model performance also called an error matrix that is a type of Contingency table. The table has two dimensions: label “predicted,” on the y-axis and label “actual” on the x-axis. The quantity of predictions that are the cells of the table is made by the algorithm.

Classification Accuracy: When working on classification issues, the Scikit-learn gives a report also call classification report, which supports each class that has the outcomes Precision, Recall, and f-score.

$$\begin{aligned} &Classification Accuracy = \\& \qquad \qquad\frac{True\;Positive + True\;Nagative}{{True\;Positive + True\;Nagative + False\;Positive + False\;Nagative}}.\end{aligned}$$

Precision: Precision is the ratio of correctly predicted to total positive instances which have been also predicted. High accuracy (Precision) implies low False Positive rate.

$$Precision=\frac{TP}{TP+FP}$$

Recall: It is the positive instance ratio that is accurately predicted to all cases in genuine class—Yes.

$$Recall=\frac{TP}{TP+FN}$$

F-score: F-score is the weighted average of both Precision and Recall. Their consideration is the combination of both FP and FN. It is generally more helpful than the accuracy, particularly for lopsided class dissemination. In the event that FP and FN having the same cost or instances then accuracy performs best.

$$F1=2*\frac{Precision*Recall}{Precision+Recall}$$

If their instances are different widely, at that point, it is better to look both Recall and Precision.

Tf-Idf vector and Count Vector are used for classifying the responses at two levels that are N-gram level and Word level. N-gram level: The range of 1–3 words will be taken. Word level: The range of single words will be taken. Words will be considered as a token (bigram, trigram). The Word level classification accuracy is better than the performance of the N-gram level. At the N-gram level while passive aggressive classifier stochastic gradient descent, using Tf-Idf vectors performance at both is better above 90% accuracy. Since, the accuracy of classification alone is not sufficient to decide the viability of the model, for these algorithms using Tf-Idf Vectors at word level other metrics explored.

6 Experimental Result

We are using Vector features Tf-Idf vectors and Count Vectors and by using algorithms at word level. Accuracy will be noted for the model. We applied content (text) classification on the body of the articles in the distinctive freely accessible datasets of UCI Machine Learning.

Confusion Matrix:

Linear-TFIDF: Fig. 2. True Positive = 56, False Positive = 80, True Negative = 1003, False Negative = 952
Multinomial NB-TFIDF: Fig. 3. True Positive = 48, False Positive = 76, True Negative = 1007, False Negative = 960,
PAC-TFIDF: Fig. 4. True Positive = 52, False Positive = 82, True Negative = 1001, False Negative = 956 (Fig. 5).
Fig. 2
Confusion matrix
Full size image
Fig. 3
Linear-TFIDF Confusion Matrix
Full size image
Fig. 4
Multinomial NB-TFIDF Confusion Matrix
Full size image
Fig. 5
PAC-TFIDF confusion matrix
Full size image

Classification Report: Linear-TFIDF (Fig. 6).

PAC-TFIDF (Fig. 7).

Multinomial NB-TFIDF (Fig. 8).

Precision value of linear-TFIDF is 96% and is higher than other models have precision value 92%. Multinomial NB-TFIDF model has the avg/total is 94% that is higher than another model. For fake news detection, the sensitivity advises how sensitive is the classifier, while in the genuine news specificity tells how specific or selective the model is for prediction. Along these, sensitivity ought to be higher, because FP is more adequate than FN in classification issues of such applications.

7 Conclusion and Future Scope

The misinformation spread by counterfeit news presents a genuine hazard for its target consumers, which could be enterprises and also individuals. While an individual expending the fake news creates misinterpreted or distorted the real-world perception, which impacts their decision-making and beliefs, enterprises experience the effects of fake news because of loss of damaging impact or competitive advantage on their image. In this paper, the proposed work for fake news identification and gives it’s the confusion matrix with a classification report of the accuracy of their perspective model. The sensitivity is high for both the models and is having comparable to esteem. By greater sensitivity streamlining, we can give indications of progress results. We can increase the classifier’s sensitivity by a decrement of the threshold for the prediction of fake news. So it would help for incrementing the TP (true positive) rate in this work. This analysis also required the improvement and the complete repository of fake and real news, which might be used for future analysis in this development of the significant area of research.

References

Anderson J (2017) Even social media-savvy teens can’t spot a fake news story
Google Scholar
Conroy N, Rubin V, Chen Y (2015) Automatic deception detection: methods for finding fake news. Proc Assoc Inf Sci Technol 52(1):1–4
Article Google Scholar
BBC N (2016) The rise and rise of fake news. BBC Trending. http://www.bbc.com/news/blogs-trending-37846860
Rubin V, Conroy N, Chen Y (2015) Towards news verification: deception detection methods for news discourse
Google Scholar
Yadav P, Hasan M (2020) A review: issues and challenges in various fake news detection mechanism. In: Proceedings of international conference on advances in computational technologies in science and engineering ACTSE
Google Scholar
Conroy NJ, Victoria LR, Yimin C (2015) Automatic deception detection: methods for finding fake news. Proc Assoc Inf Sci Technol 52(1)
Google Scholar
Qazvinian V et al (2011) Rumor has it: identifying misinformation in microblogs. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics
Google Scholar
Aggarwal A, Kumar M (2020) Image surface texture analysis and classification using deep learning. Multimed Tools Appl
Google Scholar
Aggarwal A et al (2020) Landslide-data-analysis using various time-series forecasting-models. Comput Electr Eng 88
Google Scholar
Banko M, Cafarella MJ, Soder-land S, Broadhead M, Oren Etzioni O Open information extraction from the web
Google Scholar
Magdy A, Wanas N (2010) Web-based, istatistical, fact ichecking, of, itextual documents. In: Proceedings of the 2nd international workshop on Search and mining user-generated contents
Google Scholar
Ciampaglia GL, Shiralkar P, Rocha LM, Bollen J, Menczer F, Flammini A (2015) Computational fact checking from knowledge networks. PloS one
Google Scholar
Wu Y, Agarwal PK, Li C, Yang J, Yu C (2014) Toward computational fact-checking. In: Proceedings of the VLDB endowment
Google Scholar
Shi B, Weninger T Fact checking in heterogeneous-information-networks. In: WWW'16
Google Scholar
Kumar M, Srivastava S (2019) Image authentication by assessing manipulations using illumination. Multimed Tools Appl 78(9):12451–12463
Article Google Scholar
Shu K, Wang S, Liu H (2017) Exploiting tri-relationship for fake news detection (2017). arXiv:1712.07709
Shu K, Wang S, Liu H (2017) Exploitingitri-relationshipifor fake inews idetection
Google Scholar
Shu K, Wang S, Tang J, Zafarani R, Liu H (2017) User identity linkage across online social networks: a review. ACMSIGKDD Explorations Newsletter 18(2)
Google Scholar
Ahmad I,Yousaf M,Yousaf S,Ahmad MO (2020) Fake news detection using machine learning ensemble methods. Complexity 2020:1–11. Article ID 8885861
Google Scholar
Aggarwal S et al (2020) Meta heuristic and evolutionary computation: algorithms and applications. Springer Nature, Berlin, 949 p https://doi.org/10.1007/978-981-15-7571-6. ISBN 978-981-15-7571-6)
Yadav AK et al (2020) Soft computing in condition monitoring and diagnostics of electrical and mechanical systems. Springer Nature, Berlin, 496 p https://doi.org/10.1007/978-981-15-1532-3. ISBN 978-981-15-1532-3
Gopal et al (2021) Digital transformation through advances in artificial intelligence and machine learning. J Intell Fuzzy Syst, Pre-press, 1–8. https://doi.org/10.3233/JIFS-189787
Smriti S et al (2018) Special issue on intelligent tools and techniques for signals, machines and automation. J Intell Fuzzy Syst 35(5):4895–4899. https://doi.org/10.3233/JIFS-169773
Article Google Scholar
Jafar A et al (2021) AI and machine learning paradigms for health monitoring system: intelligent data analytics. Springer Nature, Berlin, 496 p https://doi.org/10.1007/978-981-33-4412-9. ISBN 978-981-33-4412-9
Sood YR et al (2019) Applications of artificial intelligence techniques in engineering, vol 1, Springer Nature, 643 p https://doi.org/10.1007/978-981-13-1819-1. ISBN 978-981-13-1819-1

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur, India
Pritee Yadav & Muzammil Hasan

Authors

Pritee Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Muzammil Hasan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Electrical Engineering Group, Eindhoven University of Technology, Eindhoven, The Netherlands
Anuradha Tomar
BEARS, University Town, NUS Campus, Singapore, Singapore
Hasmat Malik
Department of Computer Science and Engineering, Krishna Engineering College, Ghaziabad, India
Pramod Kumar
Department of Electrical Engineering, Qatar University, Doha, Qatar
Atif Iqbal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yadav, P., Hasan, M. (2022). A Technique to Detect Fake News Using Machine Learning. In: Tomar, A., Malik, H., Kumar, P., Iqbal, A. (eds) Machine Learning, Advances in Computing, Renewable Energy and Communication. Lecture Notes in Electrical Engineering, vol 768. Springer, Singapore. https://doi.org/10.1007/978-981-16-2354-7_29

Download citation

DOI: https://doi.org/10.1007/978-981-16-2354-7_29
Published: 20 August 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2353-0
Online ISBN: 978-981-16-2354-7
eBook Packages: EnergyEnergy (R0)

Publish with us

Policies and ethics

A Technique to Detect Fake News Using Machine Learning

Abstract

Similar content being viewed by others