Abstract
Now-a-days fake news have become part and parcel of our everyday life due to its quick spreading in different social media. Fake news identification has been emerging as an important research subject due to the widespread dissemination of fake news on social and news media. Current fake news identification techniques primarily rely on the analysis of natural languages and machine learning models to assess the validity of news information in order to detect whether it is real or fake. Many traditional approaches including machine learning applications have been observed yet to detect fake news but the evolutionary based algorithms have gained lot of popularity because of their ability to converge to near optima and have low computational complexity. This motivated us to adopt a new approach with genetic algorithm to solve the fake news detection problem. In this paper, a comparative analysis is presented among SVM, Naïve Bayes, Random Forest and Logistic Regression classifiers to detect fake news applying on different datasets. SVM classifier has achieved the highest accuracy with 61%, 97% and 96% in Liar, Fake Job Posting and Fake News datasets respectively. Again, SVM, Naïve Bayes, Random Forest and Logistic Regression are considered as the fitness function in our novel GA based fake news detection algorithm. In our proposed algorithm, SVM and LR classifiers both achieved 61% accuracy in LIAR dataset and SVM and RF attained the highest accuracy as 97% in the fake job posting dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The term ‘fake news’ signifies news stories that are purposefully and undeniably bogus intended to control individuals’ view of genuine realities, occasions, and explanations [5]. It’s about information presented as news to be false since it’s based on facts that are demonstrably incorrect or events that never happened. Fake News covers with deception and disinformation which is false data deliberately spread to delude individuals [28]. As an example of fake news which occurred before the 2016 US presidential political decision in a progression of occasions now scandalously known as “Pizzagate” [9]. Again, in an odd new development in December 2016, a man who read the fake news distribution drove from North Carolina to Washington, DC and shot open a bolted entryway at the genuine Comet Ping Pong pizza joint with his attack rifle as a feature of a confused vigilante examination [25].
There’s lots of examples of fake news we can see throughout history. J. Soll [38] has stated about a fake news which hit in Italy in the year 1475. The story was about a \(2\frac {1}{2}\) year old kid who was missing and a Franciscan preacher, Bernardino da Feltre provided a series of sermons which claimed that the Jewish community had murdered the child and his blood was drained. The rumors spread fast. In 1800s within US, statement related to racism led to the publication of false stories about African Americans’ supposed deficiencies and crimes [34]. By the mid nineteenth century, current papers went ahead the scene, promoting scoops and reports, yet in addition fake stories to expand flow. The New York Sun’s “Extraordinary Moon Hoax” of 1835 [17] asserted that there was an outsider progress on the moon, and built up the Sun as a main, beneficial paper. M. Wendling [43] has mentioned that in mid-2016, Buzzfeed’s media supervisor, Craig Silverman, saw an interesting stream of totally made-up stories that appeared to start from one little Eastern European town. He and an associate began to research, and in no time before the US political decision they recognized in any event 140 fake news sites which were pulling in gigantic numbers on Facebook. V. Goel et al. [16] have also described about WhatsApp lead mob stories to murder in India like false rumors about child kidnappers.
Existing fake news detection algorithms are limited by their computational complexities. The state of the art algorithms also reflect limitations to comply with real world networks. The area of machine learning has grown significantly in the previous decade and it has changed a lot in the last few years. Machine learning is a set of approaches that learn from data or experience that emerged from the study of artificial intelligence. This maturation has centred on reappropriating methodologies and promoting a statistical and probabilistic basis for the approaches in the area of fake news detection. In this paper, we have first considered the machine learning classifiers to detect fake news in real world datasets. On the other hand bio-inspired algorithms are considered as novel methods to develop new and resilient procedures that is based on the ideas and inspiration of biological evolution. Also, bio-inspired optimization algorithms have gained popularity in machine learning for solving real world problems in recent years. Recent developments in fake news detection require to apply bio-inspired optimization algorithms to resolve the difficulties of complicated real world problems. Evolutionary algorithms are considered as the heuristic search methods based on Darwinian evolution that capture global solutions to complicated optimization problems. When using evolutionary algorithms, the chances of discovering a near-optimal solution early in the optimization process are quite high. Genetic algorithm is also an evolutionary algorithm through which we can solve many complex problems. This motivated us to adopt a novel approach to detect fake news that is based on bio inspired algorithm applying machine learning classifiers. This paper represents a novel GA based approach in fake news detection where four different machine learning classifiers are considered as fitness function in proposed algorithm. The details of the novel approach is discussed in underneath section.
We have arranged this paper as follows: Related work is discussed in Section 2. It is followed by another section which is about fake news detection using ML classifiers. There are total six subsections in the Section 3 like dataset description, data vectorization and feature selection, ML classifiers used in this paper, predictive model, confusion matrix, result analysis and ROC curve. Working flow of the model is also depicted in third section. We have then elaborated our proposed GA based approach along with the obtained results from the experiments in Section 4. It is followed by conclusion and future work in Section 5.
2 Related work
In the year 2010, three different research challenges on social spammers were presented by K. Lee et al. [29]. The classification experiments were performed using 10-fold cross validation to improve the reliability of classifier evaluations. Again, Abu-Nimeh et al. [1] have explained that a large-scale study of more than half a million Facebook posts suggests that members of online social networks can expect a significant chance of encountering spam posts. The problems of rumor detection in microblogs are also addressed and explored the effectiveness of 3 categories of features like as content-based, network-based, and microblog-specific memes for correctly identifying rumors [35]. In 2012, F. Yang et al. [46] focused on the problem of information credibility on Sina Weibo which is China’s leading micro-blogging service provider. As mentioned in the paper Sina Weibo is more of a Facebook-Twitter hybrid than a straight Twitter clone, with eight times more users than Twitter. The characteristics of rumors are also identified by examining the three aspects of diffusion such as temporal, structural, and linguistic [27]. The role of Twitter during Hurricane Sandy (2012) in spreading false photos of the disaster was highlighted by A. Gupta et al. [20] in 2013. Classification models are used to distinguish fake images from real images of Hurricane Sandy.
M. Balmas [6] focused on viewing fake news and attitudes of inefficacy, alienation, and cynicism toward political candidates. The data used here were collected in Israel during the 2006 election campaign. X. Hu et al. [23] presented to analyze the sentiment differences between spammers and normal users. Three Twitter datasets are used in this paper. The first two contain labels for social spammer detection, i.e., TAMU Social Honeypots and Twitter Suspended Spammers, and the third one Stanford Twitter Sentiment has sentiment labels. In 2015, F.M Zahedi et al. [47] focused on developing the Detection Tool Impact (DTI) theory and conceptualized a model to investigate how prominent performance and cost-related elements of detection tools could affect the perceptions of instruments and threats by users. N.K. Conroy et al. [11] explores the emerging state-of-the-art technologies that are instrumental in the acceptance and growth of identification of fake news. This paper presents a typology of many varieties of methods of veracity evaluation arising from two main categories: linguistic cue approaches and approaches to network analysis.
A. Chakraborty et al. [10] implemented a browser extension to identify clickbaits automatically and then create a browser extension that informs readers of various media outlets about the likelihood of such headlines being baited. M. D. Vicario et al. [13] elaborated the determinants governing misinformation spreading through a thorough quantitative analysis. Particularly it is focused on how Facebook users consume information related to two distinct narratives such as scientific and conspiracy news. Again, F. Morstatter et al. [30] proposed a model which increases the recall in detecting bots and allow to delete more bots. Two data sets namely ‘Arab spring Libya’ and ‘Arabic honeypot data set’ are created to test the bot detection approaches. In 2017, K. Shu et al. [37] presented a comprehensive review of detecting fake news on social media which include fake news characterizations on psychology and social theories. The way of spreading fake news on traditional news media and feature extraction like news content features, social context features are described here. L. Wu et al. [45] proposed to investigate whether knowledge learned from historical data could potentially help identify newly emerging rumors. Here, three variants of the proposed method such as pooling, elastic net and KM-SVM are introduced to validate different aspects of Cross-topic Emerging Rumor Detection (CERT). It is also shown that Facebook posts can be classified with high accuracy as hoaxes or non-hoaxes based on the users who “liked” them [40]. Two classification techniques are presented here, one is based on logistic regression, the other on a novel adaptation of boolean crowd-sourcing algorithms. In order to provide an effective solution to this problem, a novel concept is proposed that incorporates neural, statistical and external features [8]. A stance detection system namely “A simple but tough-to-beat baseline” is also defined [36] which claimed third place in Stage 1 of the Fake News Challenge. The stance label is assigned here as ‘agree’, ‘disagree’, ‘discuss’ and ‘unrelated’. M. Aldwairi et al. [3] described a solution that can be utilized by users to detect and filter out sites containing false and misleading information. The proposed solution in this paper includes the use of a tool that can identify and remove fake websites from a search engine or social media news feed results provided to a user. Again, there are two forms of rumors circulating on social media, long-standing rumors circulating for long periods of time, and new rumors arising during fast-paced events such as breaking news, where stories are published piecemeal and sometimes with an unverified status in their early stages [50].
On the other hand the solution to the task of fake news detection is presented by using Deep Learning architectures [41]. A neural network architecture is shown to accurately predict the stance between a given pair of headline and article body. In 2019, G. Gravanis et al. [18] proposed a model for fake news detection using content based features and Machine Learning (ML) algorithms. For experimentation and assessment of both feature sets and ML classifiers, a comprehensive collection of earlier data sources has been used. In the same year, a survey paper [2] is presented about fake news, importance of fake news, overall impact of fake news on different areas, different ways to detect fake news on social media, existing detection algorithms that can help to overcome the issue. S.B. Parikh et al. [33] described about consuming news from stand-alone websites to popular social media sites. This paper presents about three crucial hypotheses studies that are derived from analyses like, (a) media outlets that publish fake news (origin), (b) social media users who post or share fake news (proliferation), and (c) linguistic (tone) in which fake news are written. With the use of semantic features and different machine learning techniques, P. Bharadwaj et al. [7] also aimed to detect fake news in online posts. In the year 2020, the graphical user interface software was created by M. Aldwairi et al. [4] to allow the end user to examine the URL before visiting the website. J. Zhang et al. [48] aimed to analyze the concepts, methodologies and algorithms for identifying and assessing the subsequent output of false news posts, authors and subjects from online social networks. Again, the goal is set to classify Twitter users based on their role identities and at first, a coarse-grained public figure data set is collected automatically, then manually labeled a more fine-grained identity data set [24].
3 Fake news detection using ML classifiers
Figure 1 depicts the work flow of the model in which we have applied the ML classifiers to detect fake news. Our proposed algorithm also follows the same procedures. In our proposed approach, genetic algorithm is used as the predictive model where same ML classifiers are considered as the fitness function. The details of each module is explained in the subsections mentioned in below.
3.1 Dataset description
-
Liar: This dataset [42] is collected from fact-checking website PolitiFact through its API. It involves 12,836 short statements labeled by humans, which are sampled from different contexts, such as press releases, TV or radio interviews, campaign speeches, etc. The labels for news truthfulness are fine-grained multiple classes like pants-fire, false, barely true, half true, mostly true, and true.
-
Fake Job Posting: This dataset [14] contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.
-
Fake News: This dataset is taken from Kaggle which contains around 20K title and text of the news articles. The attributes used in the dataset are id,title,author,text and label. Label ‘1’ indicates unreliable and ‘0’ indicates reliable in the dataset.
3.2 Data vectorization and feature selection
-
Data Vectorization: After loading the dataset, the unnecessary columns have to be dropped and it goes for vectorization. There are mainly four steps involved in vectorization of the data such as splitting of the dataset in training and testing sets, taking care of missing values, taking care of categorical features and normalization of dataset. Using of scikit-learn package in python, pre-processing is efficiently done for the input data.
-
Feature Selection: To select the categorical features, TF-IDF is used in this paper as we know TF computes the frequency of a term appears in a document. Since each document is different in length, it is likely that in long documents a word might occur much more frequently than in shorter ones. Given a document d with a set of terms, T = t1,t2,...,tM, and the document length is N (the total occurrence of all terms); suppose term ti appeared xi times; then, TF of ti is denoted as
$$ TF(t_{i},d)=\frac{x_{i}}{N} $$(1)[TF(1),TF(2),......,TF(M)], i∈[1,M] is a semantic representation for the document.
On the otherhand, Inverse Document Frequency calculates how relevant a word is. IDF denotes a term’s prominence across documents. Given a set of documents, D = d1,d2,...,dk as the subjects of interest, and TF(i) for term ti is calculated for each document; suppose Ci denotes the number of documents in which xi≠ 0; then,
$$ IDF(t_{i},D)=\frac{K}{C_{i}} $$(2) -
Calculation of TF-IDF: TF and IDF are calculated in logarithmically scaled:
$$ TF(t_{i},d_{j})=\log\frac{x_{i}}{N} $$(3)$$ IDF(t_{i},D)=\log\frac{K}{C_{i}} $$(4)Where i ∈ [1, M] and j ∈ [1, K]. Then, TF-IDF is the product of TF and IDF:
$$ TF-IDF(t_{i},d_{j})=TF(t_{i},d_{j})xIDF(t_{i},D) $$(5)
3.3 Machine learning classifiers
-
Naïve Bayes: Naïve Bayes classifiers [31] are a family of simple probabilistic classifiers’ in machine learning based on applying Bayes’ theorem with powerful independent assumptions between the characteristics. Naïve Bayes classifiers are highly scalable which require a number of linear parameters for the number of variables in a learning problem (features/predictors). Instead of costly iterative approximation as used with many other forms of classifiers, maximum-likelihood training can be achieved by evaluating a closed-form expression, which takes linear time. The formula for naïve bayes classifier is:
$$ P(A|B)=\frac{P(A).P(B|A)}{P(B)} $$(6)where A and B are two conditions. Naïve Bayes classifier takes each semantic characteristic as a condition and classifies the samples with the highest probability of occurrence. The NB classifier in our model will count the number of times a word occurs in the ‘Statement’ in the LIAR dataset, ‘description’ in the Fake Job Posting dataset, and ‘text’ in the Fake News dataset, when the news is given to be fake. Then it converts it to a probability and calculates the odds of the ‘Statement’ being fake against the ‘Statement’ being true.
-
SVM: SVM is a supervised algorithm [19] for machine learning that can be used for purposes of both classification and regression. In classification issues, SVMs are often used. The idea of finding a hyperplane that best divides a dataset into two groups is the foundation of SVMs. Support vectors are the data points closest to the hyperplane and the data set points would change the direction of the dividing hyperplane if removed. The distance from either set between the hyperplane and the nearest data point is known as the margin. In our model, the goal is to choose a hyperplane with the greatest possible margin within the training set between the hyperplane and any point which provide a higher probability of correctly classifying the news as, for example ‘Statement’ in the LIAR dataset as ‘true’ or ‘fake’. We have used Radial Basis Function kernel in our model.
-
Logistic Regression: A classification algorithm used to assign observations to a discrete group of groups is logistic regression [26]. This classifier adjusts its yield using the measured sigmoid ability to restore a probability which is mapped to at least two different groups (‘true’ or ‘fake’) in case of fake news detection. A linear function \(f(x) = b_{0} + b_{1}x_{1} + \dots + b_{r}x_{r}\), also termed as logit. The variables b0,b1,...,br are the estimators of the regression coefficients, also known as predicted weights. The probabilities are defined as \( p(x_{1},x_{2}) = \frac {1} { (1 + exp(-f(x_{1},x_{2})))}\). The point above or on the hyperplane will be classified as class + 1, and the point below the hyperplane will be classified as class -1. In our model, the dependent variable is a binary variable which contains data coded as 1 (‘true’) or 0 (‘fake’).
-
Random Forest: The random forest classifier [39] is an ensemble system that operates and thus increases the accuracy of a multitude of decision trees. In order to detect fake news from the testing data, we change the parameters such as max depth, min samples split, n estimators, and random state; where max depth is the maximum depth of a decision tree, min samples split is the minimum amount of samples to split an internal node, and N estimators is the number of random forest decision trees. This algorithm is used to solve regression problems, so the mean squared error (MSE) is calculated to solve detect true and false news. MSE is defined as \(MSE=\frac {1}{N}{\sum }_{i=1}^{N}(f_{i}-y_{i})^{2}\), where N is number of data points, fi is the value returned by the model and yi is the actyual value for data point i.
3.4 Predictive model
In this stage, training dataset is fitted with the ML classifiers mentioned above and the testing dataset is run onto it to check the accuracy of the model. Here 33% data content is used for testing dataset. Once the final decision is displayed by the model then confusion matrix is created and the accuracy is attained.
3.5 Confusion matrix, evaluation matrices & ROC curve
-
Confusion Matrix: Most existing approaches consider the fake news problem as a classification problem that predicts whether a news article is fake or not as:
-
True Positive (TP): Predicted false news are really defined as false news
-
True Negative (TN): Predicted real news are really defined as true news
-
False Negative (FN): Predicted real news are really defined as false news
-
False Positive (FP): Predicted false news pieces are really defined as real news
-
-
Evaluation Matrices: Based on TP, TN, FN and FP we will measure:
$$ Precision=\frac{|TP|}{|TP|+|FP|} $$(7)$$ Recall=\frac{|TP|}{|TP|+|FN|} $$(8)$$ F1=2.\frac{ Precision.Recall}{ Precision+Recall} $$(9)$$ Accuracy=\frac{|TP|+|TN|}{|TP|+|TN|+|FP|+|FN|} $$(10) -
ROC Curve: Finally, the Receiver Operating Characteristics (ROC) is drawn and this curve provides a way of comparing the performance of classifiers by looking at the False Positive Rate (FPR) and the True Positive Rate (TPR). To draw the ROC curve, we plot the FPR on the x-axis and TPR along the y-axis. TPR (same as Recall) and FPR are defined as follows:
$$ TPR=\frac{|TP|}{|TP|+|FN|} $$(11)$$ FPR=\frac{|FP|}{|FP|+|TN|} $$(12)
3.6 Result analysis
We have used Python programming language in Jupyter notebook to implement the experiments and the system specification is Windows 10 OS with 8 GB RAM. We have considered LIAR dataset, Fake Job Posting dataset and Kaggle Fake News dataset to detect fake news using ML classifiers for the experiments.
In the above result (Table 1), it is observed that SVM has achieved the highest accuracy with 61% in LIAR dataset, also other classifiers performed the result nearest to SVM to detect the fake news. Here, SVM has also obtained 79% Recall value in our experiment.
SVM in Table 2 as well as Random Forest classifiers performed 97% accuracy to detect fake news in fake job posting dataset. Naïve Bayes and Logistic Regression also performed well to detect the fake jobs as shown in Table 2. Although SVM, LR and RF classifiers present 100% Recall value but accuracy varies due to other measures.
In the fake news dataset from Kaggle, SVM achieved highest accuracy with 96% followed by LR and NB classifiers with 95%. SVM attained the highest precision as 96% where NB classifier achieved the lowest as 88%. SVM and RF obtained the same recall value 97% as the highest. Out of around 20K title and text from the dataset, SVM performed as the best ML classifier as shown in Table 3 to detect the false news.
Based on the performance of a classification model, we draw ROC curve from the TPR and FPR value. The yellow line in Fig. 2 indicate TPR vs. FPR at various classification criteria that is plotted on a ROC curve. As the classification threshold is lowered, more items are classified as positive, resulting in an increase in both False Positives and True Positives. AUC (‘Area Under the ROC Curve’) score is also calculated using ‘roc_auc_score’ function in python. AUC measures the entire two-dimensional area below the entire ROC curve.
In LIAR dataset (Fig. 2), we have found AUC Score of NB and LR classifier is 0.574, SVM classifier is 0.583 and RF classifier is 0.575.
In FJP dataset (Fig. 3), the AUC Score of NB classifier is found as 0.516, SVM classifier is found as 0.651, LR classifier is found as 0.528 and RF classifier is found as 0.696.
In Kaggle Fake News dataset (Fig. 4), we have found AUC score of NB classifier as 0.892, LR classifier as 0.948, SVM classifier as 0.959 and RF classifier as 0.943 in our experiment.
4 A novel approach to fake news detection using genetic algorithm
4.1 Related work
The Genetic Algorithm (GA) is a method of searching for artificial intelligence that uses the theory of evolution and natural selection and is under the umbrella of the algorithm of evolutionary computing [21]. It is an effective tool for solving problems with optimization. Holland [22] created the first GA in 1975 based on biological genetic and evolutionary theories, to solve some optimization problems. GAs have been a leading tool used for providing solutions to several complex problems with optimization [44]. By producing individual ideas, GA works. The algorithm works to incorporate an assessment function, which is given by the programmer and depends on the type of problem. Two individuals are selected upon their fitness value after the evaluation process [32]. To have one or more offspring, these two individuals replicate using a GA parameter. A generation is the name of each round of these processes. This step continues until an optimal or closest solution is found or certain termination conditions are met, although this primarily depends on the programmer in the first place [49]. The efficacy of GAs relates to the choice of control parameters (population size, crossover and mutation) that interact in a complex manner [15]. The efficiency of the crossover and mutation operators was studied by several researchers on the effectiveness of the GAs and whether the intensity belongs to both, or in each one that was used alone [12].
4.2 Motivation
Genetic algorithm is a type of adaptive optimization approach that is based on biological principles. The more varied the original population, the broader the search in GA. If a local minima is discovered to be the best, it will be competitive over the whole space examined. Also, fitness function plays a vital role in genetic algorithm and defines how much good the solution is. In a GA, the calculation of fitness value is performed several times and that is why it is efficiently fast. Again a genetic operator called crossover is used to change the programming of a chromosome or chromosomes from one generation to the next. To generate offspring, two strings are randomly selected from the mating pool to crossover. Also the component of the GA explores the search space is mutation. Mutation is an essential requirement for GA convergence while crossover is not. So overall all measures and components involved in GA make the algorithm perfect. This motivated us to adopt a novel GA based algorithm to find an efficient solution in fake news detection in social networks.
4.3 Proposed Algorithm
In the proposed algorithm, each gene is represented as a string of 0’s and 1’s. At first we are randomly generating the population based on input data. In LIAR dataset, ‘Statement’ is taken as the features and labels are ‘True’ or ‘False’. In FJP dataset, ‘description’ is considered as the features and ‘fraudulent’ is taken as the labels. While the value in ‘fraudulent’ is 0 it is true and it is false while it is 1. Then corpus is formed to remove unnecessary stopwords. In our proposed approach, the population size is taken as 200. We have considered 5000 unique features (parameters) in the chromosome here. Each individual in the chromosome has fitness value which is based on fitness function. In our novel approach, we have considered ML classifiers like SVM, Naïve Bayes, Logistic Regression and Random Forest classifiers as fitness function. These are the unique fitness functions taken here which differ to the state of the art algorithms. The higher score of the fitness value indicates higher quality of the solution. Each two parents chosen from the mating pool will produce two offspring. There would be higher chances to only hold good properties of the individuals and leave out poor ones by keeping selecting and mating high-quality individuals. Crossover and mutation are applied as variation operators to get the best solution out of the two parents. Single point crossover is considered in our proposed approach. We have taken the size of the new parents is 100, mutation rate is considered as 3% and the total number of generations is 50. Each generation will produce a new solution to detect fake news. Finally, 50 numbers of new generations will conclude with the best or optimum solution which is required. Here, 33% of the orginal data is considered as test dataset. Based on the accuracy achieved after 50 generations (as stopping criteria) the confusion matrix is generated and the evaluation matrices like precision, recall and F1 score are calculated. At last the AUC socre is calculated and ROC curve is generated for each dataset based on TPR and FPR values.
4.4 Result analysis
Our experiments are implemented using Python programming language in Jupyter notebook, the system specification is Windows 10 OS with 8 GB RAM. We have considered LIAR dataset and Fake Job Posting dataset for the experiments in our proposed method. The ML classifiers mentioned in the result table (Table 4) are considered as the fitness function in our proposed GA based algorithm.
Using our proposed approach, SVM and LR classifiers achieved 61% accuracy rate in LIAR dataset to detect fake news followed by NB and RF classifiers with 60%. It is found 86% recall rate as the highest using NB as fitness function among all the classifiers, while RF achieved the highest precision rate with 63% followed by LR with 62%, SVM with 61% and NB with 60%. Although the experimental results with ML classifiers in earlier section are quite similar with our proposed approach but a new way to detect fake news is established here using genetic algorithm as the novel method.
SVM and RF together attained the higest accuracy as 97% in the fake job posting dataset using our approach followed by NB and LR classifiers as 95% accuracy. Precision and accuracy rates are oberved same for each classifier in this dataset. The value of recall has been found 100% for all the ML classifiers as fitness function. Also F1 score is same as 98% for all the classifiers (Table 5).
ROC curves are drawn in below based on the TPR and FPR values obtained from confusion matrix. Each point of TPR vs. FPR values at different classification criteria contributes to form the yellow line in the diagrams. Blue line simply indicates the diagonal from (0,0) to (1,1) coordinates. AUC score is also obtained for each model in our proposed approach.
In LIAR dataset (Fig. 5), AUC Score of LR as fitness function is found as 0.580, SVM as fitness function is found as 0.576, NB as fitness function is found as 0.566 and RF as fitness function is found as 0.586 in our proposed GA based algorithm. Here, RF classifier obtained the highest AUC score in our novel approach.
In FJP dataset (Fig. 6), RF as the fitness function provides the highest accuracy score as 0.688 followed by SVM as the fitness function and its AUC score is 0.655. On the other hand, NB as the fitness function obtained the AUC score is 0.519 and LR as the fitness function obtained the AUC score is 0.520.
5 Conclusion & future work
In this paper, we have experimented a comparative study to detect fake news applying the machine learning classifiers and tested on different datsets. In ML classifiers, SVM comes up with the highest accuracy 61% in Liar dataset. But rest of the classifiers also perform well near to the SVM as Naïve Bayes and Random Forest both achieved 60% and Logistic Regression achieved 59% accuracy. In FJP dataset, SVM and RF both provide 97% accuracy followed by LR and NB as 95% and 93% consecutively. Also SVM, LR and RF achieved 100% recall value except NB classifier in the same dataset. We have tested ML classifiers in Kaggle Fake News dataset also. Here, SVM again attained highest accuracy with 96% followed by LR and RF with 95% accuracy. And we have presented a novel GA based approach to detect fake news considering ML classifiers as the fitness function in this paper. As a result we have found that our GA based novel approach has performed a little better than the traditional machine learning applications. In our GA based method, SVM and LR as fitness function both obtained equal accuracy with 61% but we saw 59% accuracy with LR in above. Also in FJP dataset, NB as fitness function in our proposed method has done well with 95% accuracy while we saw 93% accuracy with NB in above. In continuation to this as a future work the parameters in the GA will be tuned for better performance. Here, we have considered 5000 unique features as the parameters but that will be increased up to 10000 in our future work. Also, the size of the population size is taken 200 here and it will be considered as a large set like 400-500 in future work. Similarly, the experiment will be conducted on different number of datasets to find optimal solution for fake news detection in social networks.
References
Abu-Nimeh S, Chen T, Alzubi O (2011) Malicious and spam posts in online social networks. Computer 44(9):23–28
Ahmed S, Hinkelmann K, Corradini F (2019) Combining machine learning with knowledge engineering to detect fake news in social networks-a survey. In: Proceedings of the AAAI 2019 Spring Symposium, vol 12
Aldwairi M, Alwahedi A (2018) Detecting fake news in social media networks. Procedia Computer Science 141:215–222
Aldwairi M, Hasan M, Balbahaith Z (2020) Detection of drive-by download attacks using machine learning approach. In: Cognitive analytics: concepts, Methodologies, Tools, and Applications. IGI Global, pp 1598–1611
Allcott H, Gentzkow M (2017) Social media and fake news in the 2016 election. Journal of Economic Perspectives 31(2):211–36
Balmas M (2014) When fake news becomes real: Combined exposure to multiple news sources and political attitudes of inefficacy, alienation, and cynicism. Communication Research 41(3):430–454
Bharadwaj P, Shao Z (2019) Fake news detection with semantic features and text mining. International Journal on Natural Language Computing (IJNLC) vol 8
Bhatt G, Sharma A, Sharma S, Nagpal A, Raman B, Mittal A (2017) On the benefit of combining neural, statistical and external features for fake news identification. arXiv:1712.03935
Burgess L (2018) What pizzagate teaches us about literacy. Ph.D thesis
Chakraborty A, Paranjape B, Kakarla S, Ganguly N (2016) Stop clickbait: Detecting and preventing clickbaits in online news media. In: 2016 Ieee/acm international conference on advances in social networks analysis and mining (asonam). IEEE, pp 9–16
Conroy NK, Rubin VL, Chen Y (2015) Automatic deception detection: Methods for finding fake news. Proceedings of the Association for Information Science and Technology 52(1):1–4
Deb K, Agrawal S (1998) Understanding interactions among genetic algorithm parameters. In: FOGA, pp 265–286
Del Vicario M, Bessi A, Zollo F, Petroni F, Scala A, Caldarelli G, Stanley HE, Quattrociocchi W (2016) The spreading of misinformation online. Proceedings of the National Academy of Sciences 113(3):554–559
Dutta S, Bandyopadhyay SK (2020) Fake job recruitment detection using machine learning approach. International Journal of Engineering Trends and Technology, 68
Eiben AE, Michalewicz Z, Schoenauer M, Smith JE (2007) Parameter control in evolutionary algorithms. In: Parameter setting in evolutionary algorithms. Springer, pp 19–46
Goel V, Raj S, Ravichandran P (2018) How whatsapp leads mobs to murder in india. The New York Times, 18
Gorbach J (2018) Not your grandpa’s hoax: a comparative history of fake news. Am J 35(2):236–249
Gravanis G, Vakali A, Diamantaras K, Karadais P (2019) Behind the cues: a benchmarking study for fake news detection. Expert Syst Appl 128:201–213
Gunn SR, et al. (1998) Support vector machines for classification and regression. ISIS Technical Report 14(1):5–16
Gupta A, Lamba H, Kumaraguru P, Joshi A (2013) Faking sandy: characterizing and identifying fake images on twitter during hurricane sandy. In: Proceedings of the 22nd international conference on World Wide Web, pp 729–736
Hassanat A, Almohammadi K, Alkafaween E, Abunawas E, Hammouri A, Prasath V (2019) Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information 10(12):390
Holland J (1975) Adaptation in natural and artificial systems: an introductory analysis with application to biology. Control and Artificial Intelligence
Hu X, Tang J, Gao H, Liu H (2014) Social spammer detection with sentiment information. In: 2014 IEEE International conference on data mining. IEEE, pp 180–189
Huang B, Carley KM (2020) Discover your social identity from what you tweet: a content based approach. arXiv:2003.01797
Klein D, Wueller J (2017) Fake news: a legal perspective. Journal of Internet Law (Apr. 2017)
Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M (2002) Logistic regression. Springer, Berlin
Kwon S, Cha M, Jung K, Chen W, Wang Y (2013) Prominent features of rumor propagation in online social media. In: 2013 IEEE 13Th international conference on data mining. IEEE, pp 1103–1108
Lazer DM, Baum MA, Benkler Y, Berinsky AJ, Greenhill KM, Menczer F, Metzger MJ, Nyhan B, Pennycook G, Rothschild D et al (2018) The science of fake news. Science 359(6380):1094–1096
Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots+ machine learning. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp 435–442
Morstatter F, Wu L, Nazer TH, Carley KM, Liu H (2016) A new approach to bot detection: striking the balance between precision and recall. In: 2016 IEEE/ACM International conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 533–540
Murphy KP, et al. (2006) Naive bayes classifiers. University of British Columbia 18(60):1–8
Mustafa W (2003) Optimization of production systems using genetic algorithms. Int J Comput Intell Appl 3(03):233–248
Parikh SB, Patil V, Atrey PK (2019) On the origin, proliferation and tone of fake news. In: 2019 IEEE Conference on multimedia information processing and retrieval (MIPR). IEEE, pp 135–140
Posetti J, Matthews A (2018) A short guide to the history of’fake news’ and disinformation. International Center for Journalists, 7
Qazvinian V, Rosengren E, Radev D, Mei Q (2011) Rumor has it: Identifying misinformation in microblogs. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 1589–1599
Riedel B, Augenstein I, Spithourakis GP, Riedel S (2017) A simple but tough-to-beat baseline for the fake news challenge stance detection task. arXiv:1707.03264
Shu K, Sliva A, Wang S, Tang J, Liu H (2017) Fake news detection on social media: a data mining perspective. ACM SIGKDD Explorations Newsletter 19(1):22–36
Soll J (2016) The long and brutal history of fake news. Politico Magazine 18(12):2016
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences 43(6):1947–1958
Tacchini E, Ballarin G, Della Vedova ML, Moret S, De Alfaro L (2017) Some like it hoax: Automated fake news detection in social networks. arXiv:1704.07506
Thota A, Tilak P, Ahluwalia S, Lohia N (2018) Fake news detection: a deep learning approach. SMU Data Science Review 1(3):10
Wang WY (2017) liar, liar pants on fire: A new benchmark dataset for fake news detection. arXiv:1705.00648
Wendling M (2018) The (almost) complete history of fake news. BBC News, 22
Whitley D (1994) A genetic algorithm tutorial. Statistics and Computing 4(2):65–85
Wu L, Li J, Hu X, Liu H (2017) Gleaning wisdom from the past: Early detection of emerging rumors in social media. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 99–107
Yang F, Liu Y, Yu X, Yang M (2012) Automatic detection of rumor on sina weibo. In: Proceedings of the ACM SIGKDD workshop on mining data semantics, pp 1–7
Zahedi FM, Abbasi A, Chen Y (2015) Fake-website detection tools: Identifying elements that promote individuals’ use and enhance their performance. J Assoc Inf Syst 16(6):2
Zhang J, Dong B, Philip SY (2020) Fakedetector: Effective fake news detection with deep diffusive neural network. In: 2020 IEEE 36Th international conference on data engineering (ICDE). IEEE, pp 1826–1829
Zhong J, Hu X, Zhang J, Gu M (2005) Comparison of performance between different selection strategies on simple genetic algorithms. In: International conference on computational intelligence for modelling, control and automation and international conference on intelligent agents, web technologies and internet commerce (CIMCA-IAWTIC’06), vol 2. IEEE, pp 1115–1121
Zubiaga A, Aker A, Bontcheva K, Liakata M, Procter R (2018) Detection and resolution of rumours in social media: a survey. ACM Computing Surveys (CSUR) 51(2):1–36
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Choudhury, D., Acharjee, T. A novel approach to fake news detection in social networks using genetic algorithm applying machine learning classifiers. Multimed Tools Appl 82, 9029–9045 (2023). https://doi.org/10.1007/s11042-022-12788-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12788-1