Keywords

1 Introduction

For a long time, social media has taken over a meaningful place in people’s life. Fake news primarily prevails via social media and articles available online. Fake news indulges politics, democracy, education as well as finance and business at risk. Even while false news is not a new issue, people these days place a larger focus on social media, which leads to the acceptance of deceitful remarks and the subsequent propagation of the same wrong information. It’s getting harder to tell the difference between accurate and misleading news these days, which leads to confusion and complications. Manually recognizing fake news is tough; it is only achievable when the individual identifying the news has extensive expertise in the subject. Fake news can destroy someone’s career and if it is political and harm the nation and citizens of that country as well as it can also affect businesses, products, and reputations. It is now easier to manufacture and circulate fake news because of the recent advances in computer science, but it is much more difficult to determine if the information is accurate or not.

As a result, we carried out and compared five various methods in this study. Here, we are using supervised ML classification techniques, i.e., logistic regression, decision tree, Naïve Bayes, Random Forest, and support vector machine to determine if the news being transmitted is authentic or not. We have used a dataset available on the Kaggle website which contains two datasets for there are both authentic and fraudulent news items.

The paper structure could be defined here given there are Sect. 5 in this paper. In Sect.1, we have provided the formal introduction of the research being carried out and its motive. In Sect. 2, we have discussed the relevant research that has been done in this field and the algorithm that we used in this project. Then in Sect. 3, we’ve gone through the methodology, which includes information on the flowchart, dataset, and machine learning techniques that we employed in this research. In Sect. 4, we have discussed the implementation and the results obtained in this study. After that in Sect. 5, we have concluded the study so far and the results with accuracy. Also, we proposed the future work for the study.

2 Literature Survey

The primary goal of this study is to discover the best effective classification system for detecting and quantifying false news. To find out, we looked at several classification techniques and used them in our model. We have applied five classification techniques here. Further, we are providing a brief review of the papers we have studied.

In 2017, Granik and Mesyura presented a methodology for detecting bogus news utilizing Naive Bayes on news posts on Facebook and got a 74% accuracy rate. And concluded that AI techniques could be used successfully to handle these kinds of problems [1].

In 2017, Ahmed, Traore, and Saad have suggested a methodology for identifying false news, researchers have developed an AI model using n-gram analysis. The classifier used here was the support vector machine, which has a precision of 92% [2].

In 2017, Campan et al in their study proposed a model how fake news spread on social media and how the Internet affects the diffusion of false information in creating and spreading. They also discussed the solutions to reduce the dissemination of false information and provided the future research aspects in this area [3].

In 2017, Perez-Rosas et al, Klienberg, and colleagues suggested a model that automatically detects fake news for online resources. They developed a computer algorithm and tools to detect bogus news. They work with two different datasets. The first came via the Internet, and the second resulted from a mix of human data collecting and Internet assistance [4].

In 2018, Aphiwongsophon and Chongstitvatana have a study on using Naïve Bayes, SVM, and neural networks to detect the fake news and calculated the performance measures they have found that Naïve Bayes has 96.08% and neural network and SVM 99.90% accuracy. Through this experiment, they found out that neural networks and support vector machines are having significant accuracy and high confidence [5].

In 2018, Gahirwal and colleagues suggested a support vector machine news detection based model for false or real news that has an accuracy of 87%. She had recognized comedy, negative words, ridiculousness, syntax, and punctuation using five predictive features. Its goal was to ensure that the substance of a news piece was accurate [6].

In 2019, Ozbay and Alatas have used AI techniques for detecting fake news. In the first phase, they preprocessed the dataset to transform unstructured data into structured data, and then they used text mining to construct about twenty-three supervised AI algorithms. They applied these algorithms to about three real-world datasets and found the accuracy and performance measures accordingly. The best average value they got was by using a decision tree, ZeroR, CVPS, and WIHW algorithms [7].

In 2019, Agarwal et al and colleagues taken a dataset namely the Liar dataset, and given a comprehensive study of various approaches. They offer a stacking model in this research that fine-tunes the informational knowledge obtained from user input at each level before attempting to predict something [8].

In 2019, Riece et al are working on looking for a range of elements in news articles, postings, and stories that might assist to identify false news with increased precision. He demonstrated the significance of these new qualities in evaluating bogus news. Discrimination, integrity, involvement, domain location, and temporal patterns are some of these characteristics. They used 2282 Buzzfeed items in their analysis (news articles). Using KNN, Naive Bayes, Random. Forest, XGBoost, support vector machine (SVM), and they analyzed and described the strengths and limits of this technology and discovered that XGBoost performed better when compared with other with an accuracy of 0.86 [9].

In 2020, Zhang and Ghorbani’s study elaborates that false information has been a serious concern for the industry as well as academia as it is widely utilized to confuse and persuade online users with skewed facts. Furthermore, the Internet generates and disseminates a vast volume of fantastic and incorrect information. It has emerged as a potential danger to social networking groups and has had a major adverse influence on online activities such as online commerce and networking sites [10].

In 2020, Shaikh and Patil have used three AI techniques for giving the detection model for the detection of false news. Three algorithms that they used were SVM, Naïve Bayes, and passive aggressive classifier, respectively, with SVM giving the highest accuracy of 95.05% [11].

In 2020, Smitha and Bharath have illustrated the model and different methodologies to identify and quantify fake news with the help of ML techniques and NLP techniques. Seven different classification algorithms are proposed here, and accuracy, F1score, recall, and precision are compared [12].

In 2020, Kesarwani et al demonstrated a basic strategy for detecting false news on social media using a K-nearest neighbor classifier, which obtained an accuracy of roughly 79% when evaluated against a sample of Facebook news articles [13].

In 2021, Khanam et al and colleagues carried out the research by reviewing it in two phases: firstly, they used multiple supervised learning algorithms to define the essential principles and criteria of false news found in web-based media. They proposed using scikit-learn library for processing text data. They performed techniques for feature selection to select the best fit [14].

In 2021, Nagaraja and colleagues showed in their study that false information mostly circulates through social media and is propagated further without investigating the true data. They applied various NLP techniques and two ML algorithms, i.e., Naïve Bayes and SVM which gives 63% and 75% accuracy, respectively [15].

A significant amount of prior and ongoing studies is based on fake news detection. The misleading information has always been a serious concern worldwide due to its bad influence on social, religious, educational and civilization, and many more fields. We studied research papers in order to carry out this study and filtered out the papers for extensive literature review and summarized them. The proposed model in this study has been drawn from various research being done, and the machine learning algorithms that have been applied in research papers to propose a better way for detection of fake news. We have taken the supervised algorithms which performed best in various study given in [2, 5, 6, 7, 9, 11 and 15] and implemented those algorithms to find out the results.

3 Methodology

Various stages involved in this experiment are given in the following flowchart Fig. 1.

Fig. 1
figure 1

Flowchart for fake news detection

3.1 Dataset

We obtained the dataset for this study from the Kaggle website [16] which contains two files. Out of these two, one contains real news articles and another one contains fake news articles. Real articles are around 21,417 and fake articles are around 23,481 with a total of 23,481. To further proceed with both the datasets, we had combined the dataset that contains the combination of both fake as well as real. Figure 2 shows the head of our data.

Fig. 2
figure 2

Head of the data

The dataset we are using is having five features, i.e., title, text, subject, date, and target. So, we dropped the unnecessary columns, i.e., date and title as we will be working with only text here.

The data we have used here is available in text as we know that text data requires preprocessing in order to be changed over into a suitable structure for information display. There are several methods for transforming text data, including natural language text processing approaches, which we employed. After removing the date and title columns that were no longer needed for modeling and converting the text to lowercase. We have performed stop word removal because there are many words in a text that occur very frequently in a document and have not much information such as ‘a,’ ‘is,’ ‘the,’ and ‘am.’ To improve the accuracy of analysis, these words are generally ignored using natural language toolkit (NLTK) library for stop word removal. After that, here are many words in a text that occur very frequently in a document and have no much information such as ‘a,’ ‘is,’ ‘the,’ and ‘am.’ To improve the accuracy of our analysis, these words are ignored using natural language toolkit (NLTK) library for stop word removal. After that, punctuation removal was performed as punctuation like commas and full stops don’t add much importance to text so they should be filtered out. We split the data into train and test before feeding it into the machine learning model. We have separated the 30% data into test set and 70% into train set. A subset of the dataset used to train the model has already revealed the outcome. The detection model is tested on a subset of the dataset, and the test set is utilized to forecast the outcome. For feature extraction, we have concentrated on two distinctive choice techniques: term frequency and term frequency inverted document frequency. This retrieval methodology considers the frequency of a phrase as well as the inverse document frequency.

A term's frequency in a text may be determined using term frequency. The term n in the formula represents the number of times the phrase appears in each document or text. As a result, each term has a TF value.

3.2 Machine Learning Algorithms

In this paper, we are presenting five different supervised machine learning classification algorithms. The following is a quick rundown of all the algorithms used:

3.2.1 Logistic Regression

It’s a tool for categorizing binary data. For binary classification, usually, linear regression is used to create the best bit line. When two classes can be separated linearly, logistic regression is used. It is within the supervised machine learning algorithm category. It’s a machine learning-based categorization problem-solving approach. In logistic regression, a type of predictive analysis, the probability assumptions are applied. To complete a binary classification job, a linear equation is used as input, and the logistic function and log odds are used in the logistic regression model. It employs a more complicated function when compared to linear regression.

$$ y \, = {\text{ d}}0 + {\text{ d}}1*x $$
(1)
$$ Q \, = \, 1/\left( {1 + {\text{e}}^{ - y} } \right) $$
(2)
$$ \ln \left( {Q/\left( {1 - Q} \right)} \right) \, = {\text{ d}}0 + {\text{ d}}1*x $$
(3)

where in Eq. (3), d0 is slope, d1 is intercept, and x are a data point. Equation (2) is a sigmoid function where Q has been used to eliminate the outlier’s effect.

3.2.2 Decision Tree

It’s an ML supervised classification algorithm which means we have to clarify what the information is and what the relating yield is in the preparation information. It is a tree-like construction where the information is consistently parted by a specific boundary. The elements of a dataset are addressed by the inner nodes and branches that address the decision rules and each leaf addresses the ultimate results or choices.

3.2.3 Naïve Bayes

The Naive Bayes approach, a supervised machine learning methodology based on the well-known Bayes theorem, is used to tackle classification issues. It is most commonly used for text classification with a big training dataset. One of the most simple and effective classification methods is the Naive Bayes classifier. It allows for the rapid building of machine learning models as well as effective training and testing to make speedy predictions. It’s a probabilistic classifier, which implies the algorithm’s whole basis is built on probabilities that have been computed, and it predicts based on an item’s likelihood.

Naïve Bayes Equation:

$$ P\left( {R|S} \right) \, = \, P\left( {S|G} \right)\, * \,P\left( R \right)/P\left( S \right) $$

whereP(R|S) is the posterior probability. P(S|R) is the likelihood.P(R) is the class prior probability.P(S) is the predictor of prior probability.

3.2.4 Random Forest

It is a supervised ML technique. It is basically established on the outfit learning techniques where different classifiers are united to deal with an issue and to chip away at the display of the presentation of the model. Random Forest is a classifier that calculates the dataset's predicted precision by averaging the results of many decision trees applied to different subsets of the dataset.

3.2.5 Support Vector Machine

This approach aims to find a hyperplane (where N is the number of characteristics) that clearly arranges the principal elements in an N-dimensional space. There is an assortment of hyperplanes from which to isolate the two kinds of informative items. Our point is to track down the plane with the biggest edge or distance between relevant items from the two classes. Boosting the edge distance gives some support, making it more straightforward to arrange ensuing information points.

3.3 Evaluating Measures

Evaluation metrics are frequently used to assess categorization performance. As a result, performance measurements are the most common. So, we have used different metrics to evaluate our classifiers given as follows.

3.3.1 Accuracy

It gives the comparison of actual and predicted labels, i.e., it measures how often a classifier predicts accurately. It can be formulated as

$$ {\text{Accuracy }} = {\text{ Correct prediction}}/{\text{Total data points}} $$

3.3.2 Precision

Precision is a proportion to tell how exact a classifier is performing. Precision P can be formulated as the ratio of total true positives to total predicted instances

$$ {\text{Precision}}\left( P \right) \, = {\text{ TP}}/\left( {{\text{TP}} + {\text{FP}}} \right) $$

3.3.3 Recall

It tells what percentage of positive instances were successfully identified and its formula is

$$ {\text{Recall}}\left( R \right) \, = {\text{ TP}}/\left( {{\text{TP}} + {\text{FN}}} \right) $$

3.3.4 F-Measure

It is represented as a harmonic mean of precision and recall and can be formulated as

$$ F - {\text{Measure}}\left( F \right) \, = \, 2P \cdot R/\left( {P + R} \right) $$

where

TP (True Positive) belongs to a class it belongs actually,

FP (False Positive) belongs to a class it doesn’t belong actually, FN (False Negative) doesn’t belong to a class it actually should belong, and TN (True Negative) doesn’t belong to a class it actually doesn’t belong.

4 Implementation and Results

In Figs. 3, 4, 5, 6 and 7, we are giving the various confusion matrices that we obtained after applying five supervised machine learning algorithms. As discussed in methodology, each confusion matrix contains the four values TP (True Positive), FP (False Positive), TN (True Negative), and FN (False Negative). In our experiment, TP represents that the news which was actually fake is also predicted as fake. FP represents that the news which was actually real is predicted as fake. FN represents that the news which was actually fake but is predicted as real. TN represents that the news which was actually real is also predicted as real. The real and fake true label and predicted label are shown in confusion matrix. We have implemented five supervised machine learning algorithms here, i.e., logistic regression (LR), decision tree (DT), Naive Bayes (NB), Random Forest (RF), and support vector machine (SVM).

Fig. 3
figure 3

Confusion matrix using LR

Fig. 4
figure 4

Confusion matrix using DT

Fig. 5
figure 5

Confusion matrix using NB

Fig. 6
figure 6

Confusion matrix using RF

Fig. 7
figure 7

Confusion matrix using support vector machine

Figures 8 and 9 represent the word cloud obtained from both real and fake news sets, respectively, to represent the significant textual data points/words in the dataset.

Fig. 8
figure 8

Word cloud for real news

Fig. 9
figure 9

Word cloud for fake news

Table 1 indicates the accuracy % and performance metric of all five classifiers, i.e., precision, recall, and F1-score.

Table 1 Accuracy and performance measures

As described in the methodology, we have implemented all the five ML classification algorithms and calculated the accuracy and performance measures. From Table 1, we can see that decision tree outperforms here which has the accuracy 99.59% and all performance metrics (precision, recall, and F1-score) as 1.00 perform the best. After decision tree, SVM performs with a very negligible difference in accuracy when compared. Naïve Bayes has the lowest accuracy, i.e., 94.99% and all performance metrics, i.e., (recall, F1-score, and precision) as 0.95.

Figure 10 graph shows the comparison of all the models used in this experiment which shows that decision tree has the highest accuracy of 99.59% and the Naïve Bayes classifier has the lowest accuracy of 94.99%.

Fig. 10
figure 10

Plot for comparison of all the classifiers used

In our experiment, decision tree is performing best. As the data is categorical that is the news is either fake or real, so in such cases this algorithm performs better as compared to other supervised ML algorithm. When compared, NB is a generative model while DT is discriminative model. When compared, SVM solves nonlinear issues using the kernels method, whereas decision trees handle the problem by deriving hyper-rectangles in input space.

Hence in our study, decision tree is performing best with an accuracy of 99.58%.

5 Conclusion and Future Scope

Fake news has a huge influence on our social life, as well as in other domains, such as politics and education. Fake news may create significant social and societal harm, as well as have potentially dangerous consequences. It is becoming more difficult for the citizens/consumers to obtain the information that is precise and error free and reliable because of increasing the dimensions of social media. It's critical to discover such false information early on in order to avoid the global harm it can do. As a result, in this paper, we designed a methodology for detecting false news that combines NLP techniques with supervised learning classification algorithm. In this work, we have presented a machine learning approach using various machine learning classifiers to detect fake news. After comparing the performance of each model, the conclusion can be drawn that decision tree outperforms the other algorithms being used, i.e., with the accuracy of 99.59%, and secondly, SVM performs well with the accuracy of 99.68%. This approach would be helpful to identify fake news effectively and with higher accuracy in future.

Future work could include comparing multiple deep learning approaches and new ensemble learning methods to the classification techniques used in this study and determining the best strategy for detecting fake news. Also, we may integrate a larger dataset from different sources like various URLs and news publication sites as it would be having bigger journalese and could be used for obtaining better results in a generalized manner.