Automating fake news detection using PPCA and levy flight-based LSTM

Dixit, Dheeraj Kumar; Bhagat, Amit; Dangi, Dharmendra

doi:10.1007/s00500-022-07215-4

Automating fake news detection using PPCA and levy flight-based LSTM

Application of soft computing
Published: 16 June 2022

Volume 26, pages 12545–12557, (2022)
Cite this article

Download PDF

Soft Computing Aims and scope Submit manuscript

Automating fake news detection using PPCA and levy flight-based LSTM

Download PDF

Dheeraj Kumar Dixit¹,
Amit Bhagat¹ &
Dharmendra Dangi¹

2134 Accesses
16 Citations
Explore all metrics

Abstract

In recent years, rumours and fake news are spreading widely and very rapidly all over the world. Such circumstances lead to the propagation and production of an inaccurate news article. Also, misinformation and fake news are increased by the user without proper verification. Hence, it is necessary to restrict the spreading of fake information on mass media and to promote confidence all over the world. For this purpose, this paper recognizes the detection of fake news in an effective manner. The proposed methodology in detecting fake news consists of four different phases namely the data pre-processing phase, feature reduction phase, feature extraction phase as well as the classification phase. During data pre-processing, the input data are pre-processed by employing tokenization, stop-words deletion as well as stemming. In the second phase, the features are reduced by employing PPCA to enhance accuracy. Then the extracted feature is provided to the classification phase where LSTM-LF algorithm is utilized to classify the news as fake or real optimally. Furthermore, this paper utilizes four different datasets namely the Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact dataset for evaluation. The performance evaluation and the comparative analysis are conducted and the analysis reveals that the proposed approach provides better performances when compared to other fake detection-based approaches.

ConFake: fake news identification using content based features

Article 19 June 2023

Detection of Fake News Using Clustering Algorithms

Fake News Detection in Mainstream Media Using BERT

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the contemporary world, fake news is considered as one of the major intimidations to the economy, democratic organisation and journalism. Social network platform provides a suitable channel for the consumers in accessing, creating and sharing diverse data (Monti et al. 2019). The usage of social media network has been increased since numerous people receive and look for recent updates at the appropriate time. On the other hand, social media provides an opportunity to spread countless misleading and fake information by the user and such extensive spreading of fake information has deleterious community consequences (Zhou et al 2020). Initially, the spreading of false information diminishes the faith of public figures in journalism and government department. The fake information broadcasted across the US during the presidential ballot vote-2016 was sarcastically large than spreading of accurate news. Secondly, the fake information changes the way of people that are reacting to authorized and justifiable news. A survey has been demonstrated that the confidence of the public in social media has degraded dramatically between various political parties and peoples of various age categories. Thirdly, unrestrained and riotous web-based fake information results in off-line social incidents (Zervopoulos et al. 2020). Hence, it is necessary to restrict the spreading of fake information on mass media thus to promote confidence all over the world (Oliveira et al. 2020).

Wardle (2017) in his initial finding classifies the fake news into seven categories: parody or satire (i.e. the intention is not to harm but it has the potential to fool), misleading content (i.e. misleading an individual or information), manipulated content (i.e. when genuine information is manipulated to deceive), false context (i.e. the genuine content is shared falsely), false connection (i.e. the heading and content doesn’t support each other), impostor content (i.e. impersonated genuine resources), fabricated content (i.e. the content is hundred percent false and creates harm). In addition to this, the detection of fake news is obviously a major concern for journalists, news reporters, and news industries; also the tools employed in detecting the false news have turned out to be a dreadful requirement (Ruchansky et al. 2017). Investigating the fake news manually is considered as a challenging task. Hence automatic detection of false information has dragged a great deal of awareness in the community of natural language programming that helps in alleviating the time consumption and troublesome human endeavour of examining the reality. Even though, determining the reliability of fake news is considered as a thorny issue for an autonomous system (Kaur et al. 2020). Initially, for better identification of false news, it is necessary to understand the similar topic regarding the report of other news organisations which is termed as “Stance detection”. This stance recognition has been consistently regarded as a significant base for numerous other errands that include online controversy examination, determination of rumours on social media etc. (Le et al. 2020).

The stance detection is categorized as agree, disagree, unrelated and discuss. This categorization is made in accordance with the agreement level among the headlines as well as the content allocated to the headline (Okoro et al. 2018). Furthermore, autonomous detection of fake news in which the current approach is categorized generally into propagation-based and content-based approaches. In the present hypothetical situation, the most common sources of information are the media outlet. Individual sharing of news has been significantly grown in the last years and it becomes a difficult task in differentiating the news emanates from the dependable source from where the original news is generated (Kaliyar et al. 2020). As a consequence, the false news receives plenty of investigation namely Twitter, Google, Facebook etc. To handle the issue based on the spreading of fake news, numerous statistical and machine learning approaches are employed. Statistical approach refers to the relationship among numerous characteristics of data and analysing pattern whereas the classification of uncertain content uses machine learning approaches (Lara-Navarra et al. 2020; Sundararaj 2016, 2019; Sundararaj et al. 2018, 2020; Ravikumar and Kavitha 2020; Rejeesh and Thejaswini 2020; Kavitha and Ravikumar 2021; Hassan and Rashid 2020; Hassan 2020; Hassan et al. 2021; Haseena et al 2014; Gowthul Alam and Baulkani 2019a, 2019b; Nirmal Kumar et al. 2020; Nisha and Madheswari 2016).

This paper proposes four different phases namely the data pre-processing phase, Feature reduction and extraction phase as well as the classification phase. For better reduction of high-dimensional features, PPCA is employed. LFTM-LF distribution is employed in feature extraction and classification for optimal detection of fake news. The major contributions of the paper are discussed in the following section.

Utilizing four different phases namely the data pre-processing phase, Feature reduction and extraction phase as well as classification phase for fake news detection.
Employing PPCA for feature reduction and extraction thus reducing high-dimensional features.
Proposing LSTM-LF for optimal detection of fake news with high rate of accuracy.
Comparing our proposed approach with other existing approaches thus evaluating the system effectiveness.

The rest of the paper is organized in the following manner. The existing prior literature work regarding the detection of fake news is discussed in Sect. 2. In Sect. 3, four different phases namely data pre-processing phase, Feature reduction phase, Feature extraction phase and classification phase for optimal fake news detection. The performance analysis and the comparative results of our proposed approach are discussed in Sect. 4. Section 5 concludes the article.

2 Review of related works

In the past few years, numerous researchers proposed various machine learning approaches and mining techniques to determine and to detect the fake news that spreads through social media. To gain better knowledge regarding the detection of fake news, the details of various research articles are summarized in the following section.

Cui et al. (2019) proposed explainable fake news detection (dEFEND) system to determine and to detect the fake news. The performance measures employed in evaluating this approach are Precision, F-measure, accuracy and mean average precision and mean average precision. GossipCop and PolitiFact were the two various datasets used in this approach. The experimental analysis was conducted and the results revealed that the detection performances were high, but this approach failed to consider the posts and explainable comments.

The early detection of fake news using Structure-aware Multi-head Attention Network (SMAN) was demonstrated by Yuan et al. (2020). Three various datasets namely the Weibo, Twitter-15 and Twitter-16 were employed to evaluate the fake news detection. The performances involved in this approach were Accuracy, precision, Recall, F1-measure. Moreover, the accuracy and precision value was very high when compared with other approaches. The execution time of this approach is very high is considered as the major drawback of this approach.

Duan et al. (2020) developed an online incremental log keyword extraction technique by employing a multi-layer dynamic particle swarm optimization algorithm along with deep LSTM networks. RMSE, MAPE, MSE, MAE were the metrics employed in simulation with respect to four various datasets namely HDFS, Hadoop, Spark, Open stack. The rate of robustness and accuracy was high and this approach failed to integrate log keywords in LSTM.

The fake news detection in social media using supervised artificial intelligence algorithms was proposed by Ozbay and Alatas (2020). Buzz Feed, Random political datasets, ISOT fake news were the three different types of datasets employed in this approach. In addition to this, Accuracy, precision, Recall, F1-measure were the simulation parameters employed and thus a very high rate of accuracy was obtained. Also, this approach failed to integrate ensemble approaches to detect fake news.

Kesarwani et al. (2020) developed utilized k-nearest neighbour classifier to detect fake news on social media. True label, accuracy, F1-measure was the performance metrics employed in this approach. The dataset used here was Buzz Feed dataset. From the experimental analysis, the results revealed that the classification accuracy was high. The implementation seems very complex is considered as the major drawback of this approach.

The hierarchical propagation network was proposed by Shu et al. (2020) to detect and to determine the fake news. Accuracy, precision, Recall, F1-measure were the simulation measures evaluated for GossipCop and PolitiFact datasets. The experimental analysis was conducted and the result revealed that the robustness of the proposed approach was high, but the fake news detection is unsupervised.

Wang et al. (2020) developed SemSeq4FD that integrated global semantic relationship and local sequential order to enhance text representation for fake news detection. This paper utilized datasets of two various sectors: English and Chinese datasets (LUN and SLN, Weibo and RCED). Average and maximum length, accuracy, precision, Recall was the performance measures employed for evaluation. This approach recognized multi-view fake news but the flexibility was poor.

The dual-stage transformer model for covid-19 fake news detection and fact-checking was demonstrated by Vijjali et al. (2020). Accuracy, precision and MAP were evaluated for COVID-19 dataset. The efficiency rate was very high but the overall effectiveness is poor is the major disadvantage of this approach.

Zhang et al. (2020) proposed BERT-based domain adaptation neural network for Multi-Modal Fake News Detection. The evaluation measures employed in this approach was Accuracy, precision, Recall, F1-measure and the datasets utilized in this approach was Twitter and Weibo. The fake news detection was enhanced but this approach failed to design a probabilistic model.

Multimodal variational autoencoder for fake news detection was developed by Khattar et al. (2019). Accuracy, precision, Recall, F1-measure were the performance measures employed in this approach for various datasets namely Twitter and Weibo. During analysis, the fake news was detected accurately; but this approach failed to propagate the Twitter data. The summary of the existing literature works is discussed in Table 1.

Table 1 Review of prior literature works

Full size table

3 Proposed methodology

A news item is said to be fake if the content is confirmed to be true or false. Let us assume $y = \left\{ {y_{1} ,y_{2} ,...y_{n} } \right\}$ indicates a dataset containing N number of news items. Every news item $k \in [1,N]$ containing i data sources is represented by $y_{k} = \left\{ {y_{1k} ,y_{2k} ,...y_{nk} } \right\}$. Furthermore, $y = \left\{ {y_{1} ,y_{2} ,...y_{n} } \right\}$ indicates a class label set closely associated with the dataset y. Every class label $y_{k} \in y$ obtains a label set $L = l_{1} ,l_{2} ,...l_{M}$. Here the total number of degree of fakeness recognized is denoted by M. The block diagram for the proposed approach based on fake news detection is represented in Fig. 1. The proposed approach comprises of four different phases namely the data pre-processing phase, Feature reduction and extraction phase as well as classification phase. During data pre-processing, the input data is pre-processed by employing tokenization, stop-words deletion as well as stemming. In the second phase, the features are reduced by employing PPCA to enhance accuracy and to reduce high-dimensional data. Then the extracted feature is provided to the classification phase where LSTM-LF algorithm is utilized to classify the news as fake or real. The detailed descriptions of each respective phase are employed in the following section.

3.1 Data pre-processing phase

The data pre-processing in other terms referred to as a data mining process that is capable of transforming unstructured, variable, inconsistent and incomplete data into machine understanding pattern. Numerous tasks namely the conversion process of normal texts into the lowercase letter, Deletion of stop words, stemming process as well as tokenization are performed in data pre-processing phase. The following section provides a detailed description of each respective task involved in the pre-processing phase.

(a) Tokenization

The term tokenization is the process of division of original texts into smaller segments that are referred to as tokens. On the other hand, the punctuations from the text data are removed by means of tokenization. To remove the number terms from a particular sentence, number filters are employed. The transformation of textual data to upper and lower cases utilizes case converters. In the end, the words containing less number of characters are removed using N-char filters (Mullen et al. 2018).

(b) Deletion of stop words

Stop words are insignificant and not so essential but they are used frequently in connecting expressions and completing the sentence. The stop words are quite usual and prevail in every sentence that does not carry any information. On the other hand, there are approximately five hundred stop words in English; where preposition, conjunction and pronoun are considered as few stop words. The examples of stop words are what, on, am, under, that, when, and, against, by, a, above, an, once, too, where, any, again, the etc. Therefore, by deleting the stop words, space and processing time are saved (Umer et al. 2020).

(c) Stemming process

The main intention of the stemming process is to attain the fundamental form of words that consists of identical meaning with diverse words. During this process, various grammatical words: noun, adjective, verb, adverb etc. are converted into source form. Let us consider an illustration, the word consultant, consulting, consultative, consultants, consult are stemmed to the word “consult”. Thus reduction of words (i.e. stemming of words) into a regular fundamental form is considered as an effective approach (Dharmendra and Suresh, 2015).

Thus the characters and the redundant terms namely the texts, stop words and numbers are filtered in the data pre-processing phase.

3.2 Feature reduction using PPCA

Followed by the data pre-processing phase, the feature reduction phase is employed to reduce the dimensions of the features. High-dimensionality data is considered as the major issue in data pre-processing phase and it is necessary to eliminate the redundant and unrelated feature to enhance the accuracy. By reducing the feature, the processing speed is minimized that results in enhanced performances. On the other hand, the feature reduction has substantial consequences on the result of the textual classification. Hence, to reduce the dimensions of the features and to enhance the rate of accuracy, this paper proposes probability principal component analysis (PPCA) that is discussed in the following section.

Probability principal component analysis (PPCA)

The mathematical formulation, its respective definitions and derivations of probabilistic principal component analysis are discussed in the subsequent section. Let us assume, $Y_{J}$ be the latent variable that consists of a normal distribution function (Li et al. 2020). Thus,

$$ Y_{J} \sim \,n\,(0,\,I_{{_{M} }} ) $$

(1)

From Eq. (1), the normal distribution function and the identity matrix is represented by n and I_M respectively. Then the projection residuals $\omega_{J}$ are distributed normally. Thus,

$$ \omega_{J} \sim \,n\,(0,\,\delta^{2} I_{M} ) $$

(2)

The probability distribution function in accordance with Eqs. (1) and (2) is formulated in the subsequent equations.

$$ \left[ {Y_{J} } \right]\sim \left( {w^{T} w + \delta^{2} I_{M} } \right)^{ - 1} w^{T} (t_{J} - \phi );\quad {\text{where}}\;w = w_{1} ,w_{2} ,...w_{K} $$

(3)

$$ \left[ {Y_{J} ,Y_{J}^{T} } \right]\sim \delta^{2} \left( {w^{T} w + \delta^{2} I_{M} } \right)^{ - 1} + \left[ {Y_{J} } \right]\;\left[ {Y_{J}^{T} } \right] $$

(4)

From the above equations, the unconditional probability distribution function is represented by $P\left[ {Y_{J} |t_{J} ,\,w,\,\delta^{2} } \right]$. The sample mean value and the transpose operator are denoted by $\phi \,\,{\text{and}}\,\,T$. Then by employing log-likelihood function in accordance with the probability function is represented in Eq. (4). Hence,

$$ L_{L} = \sum\limits_{J = 1}^{N} {P\left[ {Y_{J} ,t_{J} } \right]} $$

(5)

The model parameter of PPCA is obtained by employing EM algorithm (Liu et al. 2020). Thus by utilizing PPCA, the dimensions of the features are reduced and the processing speed is minimized. The feature selection is considered as one of the most effectual technique that reduces the high-dimensional data. The classification process is enhanced by feature selection. In addition to this, it can also eliminate the noisy, inappropriate and irrelevant data. It also selects a delegate subset among all data to reduce the complication issues during classification processes. Thus, the elimination of unwanted features minimizes the time of computation thereby attaining high performances with enhanced accuracy rate (Menaga and Revathi 2020).

3.3 Classification using LSTM-LF

The classification phase is one of the most significant processes involved in the detection of fake news. The fake news is detected by identifying the data displayed in the article is real as a result of enumerating the bias of the article written, the interrelation among the body and article of the headline is evaluated. For optimal classification of fake and real news, this paper proposes an LSTM based LF approach that is discussed as follows.

(a) Long Short Term Memory (LSTM)

The most prominent variants of recurrent neural network (RNN) are the LSTM that has accomplished a successful outcome in recent years. In LSTM, the memory cell is the centre part that consists of a gating system. This gating system is capable of judging the information is either beneficial or not. In general, every LSTM cell is composed of three major gates: input gate forgot gate and the output gate as shown in Fig. 2. To recognize long term dependency, the LSTM utilizes separate cell thus the current input value is updated (Duan et al. 2020).

The numerical expressions with respect to three various gates are derived in the following section.

$$ i(g) = \,\sigma \left[ {w_{i} \cdot \left( {y_{T - 1} ,\,h_{T} \,} \right)} \right] + \,b_{i} ) $$

(6)

$$ f(g) = \,\sigma \left[ {w_{f} \cdot \left( {y_{T - 1} ,\,h_{T} \,} \right)} \right] + \,b_{f} ) $$

(7)

$$ o(g)\, = \,\sigma \left[ {w_{o} \cdot \left( {y_{T - 1} ,\,h_{T} \,} \right)} \right] + \,b_{o} ) $$

(8)

From the above equation, the input, forget and the output gate is represented by $i(g)$, $f(g)$ and $o(g)$ respectively. The sigmoid activation function is represented by $\sigma$. The bias function and the weight function of three respective gates are represented by $b_{i}$, $b_{f}$, $\,b_{o}$ and $w_{i}$, $w_{f}$, $w_{o}$. The hidden state and the input state are represented by $y_{T - 1} \,\,{\text{and}}\,\,\,h_{T}$. In addition to this, the equations for hidden state and cell state are derived as follows.

$$ c_{i} \, = \,{\text{Tan}} H\left[ {w_{c} \cdot \left( {y_{T - 1} ,\,h_{T} \,} \right)} \right] + \,b_{c} ) $$

(9)

$$ c_{i1} \, = \,f(g) \circ \,c_{T - 1} \, + \,i(g) \circ c_{i} $$

(10)

$$ y_{T} \,\, = \,o(g) \circ \,{\text{Tan}} H(c_{i} \,) $$

(11)

From Eqs. (9), (10) and (11), the hyperbolic activation function, the weight and bias function with respect to the cell state is denoted by ${\text{Tan}} H$, $w_{c}$ and $b_{c}$.

(b) Levy Flight (LF) Distribution

In general, the LF distribution is stimulated by numerous physical or natural phenomenon in the environment. LF demonstrates the enhanced performances in searching for resources even in uncertain conditions. The living species varieties namely monkeys, humans, fruit flies and spiders trails their path of levy flight style. The mathematical model involved in determining the levy flight distribution is described in the subsequent section (Houssein et al. 2020).

It is necessary to determine two different features to generate random walk: step length and direction. The step length selects the levy distribution to generate random walk. Then the direction moves towards the target. Therefore, the step length in accordance with the Mantegna algorithm is determined in Eq. (12).

$$ SL = \frac{A}{{\left| B \right|^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 \delta }}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$\delta $}}}} }};\,\,{\text{where}}\,\,\delta \to 0 < \delta \le 2 $$

(12)

From the above equation, $SL\,{\text{and}}\,\,\delta$ signifies the step length and levy distribution index.

From Eq. (14),

$$ A\sim P(0,\sigma_{A}^{2} ),\,\,\,B\sim P(0,\sigma_{B}^{2} ) $$

(13)

Then the mathematical derivation for standard deviation with respect to A and B are formulated in Eq. (14).

$$ \sigma_{A} = \left\{ {\frac{{G(1 + \delta ) \times \sin ({{\pi \delta } \mathord{\left/ {\vphantom {{\pi \delta } 2}} \right. \kern-\nulldelimiterspace} 2})}}{{G\left[ {(1 + \delta )/2} \right] \times \delta \times 2^{(\delta - 1)/\delta } }}} \right\} $$

(14)

$$ \sigma_{B} = 1 $$

(15)

From Eq. (15), the gamma function with respect to integer Z is obtained as,

$$ G(Z) = \int\limits_{0}^{\infty } {T^{Z - 1} {\text{e}}^{ - T} {\text{d}}T} $$

(16)

In addition to this, the LFD computes the Euclidean distance formula among two neighbouring search agents. Thus,

$$ E_{D} (X_{k} ,X_{l} ) = \sqrt {(x_{k} - x_{l} )^{2} + (z_{l} - z_{k} )^{2} } $$

(17)

The position co-ordinate of $X_{k} \,{\text{and}}\,\,X_{l}$ are $x_{k} ,z_{k}$ and $x_{l}$, $z_{l}$ respectively. Then $E_{D}$ is compared with the threshold till the search agents are completed after certain conditions. The algorithm adjusts its position if the subsequent distance is less than that of a threshold value. Hence,

$$ X_{l} (T + 1) = L_{F} (X_{k} (T),\,X_{Lead} ,\,up_{L} ,\,lo_{L} ) $$

(18)

From Eq. (20), the index values for the total number of iterations are represented by T. The levy flight function in accordance with the step length and direction are represented as $L_{F}$. $up_{L} \,{\text{and}}\,\,lo_{L}$ signifies the upper and lower limit in the search space (Zhao et al. 2020). The agents’ position containing a very low number of neighbours is represented as $X_{Lead}$. Then $X_{l}$ move to the position with very low number of neighbours.

$$ X_{l} (T + 1) = lo_{L} + (up_{L} - lo_{L} )\,R\,(\,);\,\,{\text{where}}\,\,R \to \left[ {0,1} \right] $$

(19)

Then the comparative scalar value $\,C_{V}$ with respect to random position and $X_{l}$ is represented by,

$$ r = R\,(\,),\,\,\,C_{V} = 0.5 $$

(20)

The exploration capability and the performances of the algorithm are enhanced by varying the solutions. Therefore, the solution updation equation is obtained in Eq. (23),

$$ X_{l} (T + 1) = B_{S} + \beta_{1} \times T_{FN} + R(\,) \times \beta_{2} \times (B_{S} + \beta_{3} X_{Lead} )/2 - X_{l} (T)) $$

(21)

Then the new position is computed as,

$$ X_{l}^{New} (T + 1) = L_{F} \,(X_{l} (T + 1),\,B_{S} ,\,up_{L} ,\,lo_{L} ) $$

(22)

From Eqs. (21) and (22), $\beta_{1} ,\beta_{2} \,{\text{and}}\,\beta_{3}$ represents the random numbers where $0 < \beta_{1} ,\beta_{2} \,{\text{and}}\,\beta_{3} \le 10$. The total target fitness function and best objective fitness function is represented as $B_{S}$ and $T_{FN}$ respectively. Where,

$$ T_{FN} = \sum\limits_{T = 1}^{nn} {\frac{{F_{D} \times X_{T} }}{nn}} $$

(23)

From Eq. (23), $nn\,\,{\text{and}}\,\,F_{D}$ signifies the total number of neighbours and fitness degree respectively. The neighbouring position with respect to $X_{l} (T)$ is represented as $X_{T}$. Then the fitness degree for every neighbouring solution is derived as follows.

$$ F_{D} = \frac{{\partial_{1} (B - Min(B))}}{Max(B) - Min(B)} + \partial_{2} $$

(24)

From Eq. (25),

$$ B = \frac{{fit(X_{l} (T))}}{{fit(X_{l} (T))}},\quad {\text{where}}\;\partial_{1} > 0\;{\text{and}}\;\partial_{2} \le 1 $$

(25)

The iteration process is repeated till the algorithm attains its best optimal solution. Figure 3 describes the formation of proposed LSTM-LF distribution for fake news detection.

4 Experiments and discussions

This section depicts the outcome regarding the experimentation and simulation of the proposed approach for fake news detection. Here, various experimental analysis is conducted for four different datasets namely the Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact dataset are discussed. Finally, the comparative analyses of various approaches are discussed to determine the effectiveness of the system.

4.1 Experimental configuration

The simulation experiments for the proposed approach for fake news detection was implemented under the platform of MATLAB 2016a that consists of CPU E5/ 2690 v2, Intel ®, NVIDIA CPU, 20 GB RAM, Xenon ® and 3 GHz processor.

4.2 Parameter specifications

The parameter details of LSTM, Levy flight distribution algorithms are set as follows. The specification and its respective range values are discussed in Table 2.

Table 2 Parameter specifications

Full size table

4.3 Simulation metrics

The performances of the proposed approach are evaluated by employing various evaluation measures namely accuracy, precision, specificity and recall. The mathematical expressions regarding each respective measure in terms of fake news detection are discussed in the subsequent section.

$$ {\text{Accuracy}} = \frac{{\left| {T_{N} } \right| + \left| {T_{P} } \right|}}{{\left| {T_{N} } \right| + \left| {T_{P} } \right| + \left| {F_{N} } \right| + \left| {F_{P} } \right|}} $$

(26)

$$ {\text{Precision}} = \frac{{\left| {T_{P} } \right|}}{{\left| {T_{P} } \right| + \left| {F_{P} } \right|}} $$

(27)

$$ {\text{Specificity}} = \frac{{\left| {T_{N} } \right|}}{{\left| {T_{N} } \right| + \left| {F_{P} } \right|}} $$

(28)

$$ {\text{Recall}} = \frac{{\left| {T_{P} } \right|}}{{\left| {F_{N} } \right| + \left| {T_{P} } \right|}} $$

(29)

From Eqs. (26) to (29),

True positive $\left| {T_{P} } \right|$: The prediction is true positive, if the predicted fake news is practically counterfeit.

True negative $\left| {T_{N} } \right|$: The prediction is true negative, if the predicted real news is practically authentic.

False positive $\left| {F_{P} } \right|$: The prediction is false positive, if the predicted fake news is practically authentic.

False negative $\left| {F_{N} } \right|$: The prediction is false positive, if the predicted fake news is practically authentic.

4.4 Dataset description

This section utilizes four different datasets namely the Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact dataset (Ozbay and Alatas 2020; Shu et al. 2020) for the detection of fake news. Here 80% of datasets are trained and the rest 20% of datasets are employed for validation purposes. The training and testing dataset details regarding fake news detection are mentioned in Table 3.

Table 3 Training and testing specifications

Full size table

4.5 Dataset 1: Buzzfeed

Buzzfeed comprises of two different types of news set namely the fake news and the real news. The Buzzfeed dataset was collected from the false new article regarding the presidential election in US-2016. The dataset comprises of 1700 news articles collected from Facebook. The selected terms of the Buzzfeed dataset are political, nation, bill, party, country, democrat etc.

4.6 Dataset 2: GossipCop

The GossipCop dataset consists of news content discussed by various specialized and proficient journalists who are experts in collecting temporal information and social contents. The GossipCop dataset comprises 17,520 news data and among them 5500 are fake. The tweets, retweets and replies of this dataset are 1,060,000; 555,550 and 235,750 respectively.

4.7 Dataset 3: ISOT

The ISOT dataset consists of real and fake news that are acquired from various real-world sources. Here, the dataset comprises of 44,900 data and among them 21,578 are real and the rest 23,322 are fake news.

4.8 Dataset 4: Politifact

The Politifact dataset consists of news content discussed by various specialized and proficient journalists who are experts in collecting temporal information and social contents. The GossipCop dataset comprises of 700 news data and among them 450 news are fake. The tweets, retweets and replies of this dataset are 278,075; 295,530 and 127,426 respectively. The detailed description of testing and training data is obtained in Table 3.

4.9 Performance evaluations

This section depicts the evaluation of performances of the proposed approach for fake news detection. In addition to this, the graphical analysis for various performance measures such as accuracy, precision, specificity and recall with respect to four various datasets namely Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact dataset is evaluated.

4.9.1 Confusion matrix

Table 4 describes the confusion matrix regarding the fake news with respect to four values namely the true positive, false positive, true negative and false negative. In the confusion matrix, the news sample is categorized as fake news or real news.

Table 4 Confusion matrix regarding fake news

Full size table

4.9.2 Dataset analysis of various metrics

This section provides the dataset results for four different datasets namely the Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact dataset for four various metrics like accuracy, precision, specificity and recall. Figure 4a provides the graphical representation for four respective Datasets and accuracy rate. The experimental analysis is conducted and the results reveal that the accuracy rate for Buzzfeed dataset, GossipCop dataset, ISOT dataset and Politifact dataset obtained are 95%, 96%, 94% and 93% respectively.

In Fig. 4c, the graph is plotted between the precision rate and respective datasets. The precision rate obtained for Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact datasets are 91%, 94%, 89% and 90%. It is also noted that the precision value of ISOT dataset is comparatively low when compared with the other three datasets. Furthermore, the specificity results with respect to four different datasets are obtained in Fig. 4b. The precision rate is plotted and the results obtained are 94%, 95%, 91% and 97% correspondingly. From the precision analysis, it is well noted that the precision value of ISOTis bit low than other respective datasets. Figure 4d depicts the graphical representation for recall value for Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact datasets. The recall value obtained is 89%, 88%, 87% and 90% respectively.

Figure 5 describes the overall performances for four various datasets namely Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact datasets. By employing our proposed approach, the experimental analysis is conducted and the results obtained are 93%, 91%, 89% and 90% respectively. From the discussion, it is clear that the performance rate obtained for Buzzfeed, GossipCop and Politifact datasets are almost similar; but the performance rate obtained for ISOT dataset is comparatively low when compared with other three datasets.

4.9.3 Evaluation results

The evaluation results for the proposed approach with respect to four simulation metrics like accuracy, precision, specificity and recall are plotted in Fig. 6. The overall percentage rate of our proposed approach obtained in terms of accuracy is 98.5% and for specificity, the value achieved is 97.3%. In case of precision and recall, the percentage rate obtained is 98% and 95% respectively. From the above analysis, it is well noted that the proposed approach provides a better performance rate for all simulation metrics.

4.10 Comparative analysis

This section portrays the state-of-art comparison for various performance measures namely the accuracy, precision, specificity and recall for fake news detection. Figure 7a–d provides the graphical analysis for various simulation metrics and compared our proposed approach with other approaches such as CNN-LSTM (Umer et al. 2020), CNN-Bidirectional LSTM (Kumar et al. 2020), and FNDnet (Kaliyar et al. 2020). The experimental analysis is conducted and the analysis reveals that the proposed approach provides the accuracy, precision, specificity and recall rate of about 98.5%, 98%, 97.3% and 95% respectively. This reveals that the proposed approach provides better performances when compared with other fake news detection approaches.

5 Conclusion

Investigating the fake news manually is considered as a challenging task. Hence automatic detection of false information has dragged a great deal of awareness in the community of natural language programming that helps in alleviating the time consumption and troublesome human endeavour of examining the reality. To overcome such shortcoming, this paper proposed four different phases namely the data pre-processing phase, feature reduction phase, feature extraction phase as well as the classification phase. During data pre-processing, the input data is pre-processed by employing tokenization, stop-words deletion as well as stemming. In the second phase, the features are reduced by employing PPCA to enhance accuracy and high-dimensional data. Then the extracted feature is provided to the classification phase where LSTM-LF algorithm is utilized to classify the news as fake or real optimally. In addition to this, this paper utilized three different datasets namely the Buzzfeed dataset, GossipCop dataset, ISOT dataset as well as Politifact dataset for evaluating our proposed approach. Then the evaluation results for the proposed approach with respect to four simulation metrics like accuracy, precision, specificity and recall are conducted. Finally, our proposed approach is compared with other approaches such as CNN-LSTM, CNN-Bidirectional LSTM, and FNDnet and the analysis reveals that the proposed approach provides the accuracy, precision, specificity and recall rate of about 98.5%, 98%, 97.3% and 95% respectively. In future, various ensemble approaches must be integrated with the optimization algorithms to boost the performances and to detect the fake news more optimally.

Data availability

Enquiries about data availability should be directed to the authors.

References

Cui L, Shu K, Wang S, Lee D, Liu H (2019) Defend: a system for explainable fake news detection. In: Proceedings of the 28th ACM international conference on information and knowledge management, pp 2961–2964
de Oliveira NR, Medeiros DSV, Mattos DMF (2020) A sensitive stylistic approach to identify fake news on social networking. IEEE Signal Process Lett 27:1250–1254
Article Google Scholar
Dharmendra S, Suresh J (2015) Evaluation of stemming and stop word techniques on text classification problem. Int J Sci Res Comput Sci Eng 3:1–4
Google Scholar
Duan X, Ying S, Cheng H, Yuan W, Yin X (2020) OILog: an online incremental log keyword extraction approach based on MDP-LSTM neural network. Inf Syst 95:101618
Article Google Scholar
Gowthul Alam MM, Baulkani S (2019a) Local and global characteristics-based kernel hybridization to increase optimal support vector machine performance for stock market prediction. Knowl Inf Syst 60(2):971–1000
Article Google Scholar
Gowthul Alam MM, Baulkani S (2019b) Geometric structure information based multi-objective function to increase fuzzy clustering performance with artificial and real-life data. Soft Comput 23(4):1079–1098
Article Google Scholar
Haseena KS, Anees S, Madheswari N (2014) Power optimization using EPAR protocol in MANET. Int J Innov Sci Eng Technol 6:430–436
Google Scholar
Hassan BA (2020) CSCF: a chaotic sine cosine firefly algorithm for practical application problems. Neural Comput Appl 33:1–20
Google Scholar
Hassan BA, Rashid TA (2020) Datasets on statistical analysis and performance evaluation of backtracking search optimisation algorithm compared with its counterpart algorithms. Data Brief 28:105046
Article Google Scholar
Hassan BA, Rashid TA, Mirjalili S (2021) Formal context reduction in deriving concept hierarchies from corpora using adaptive evolutionary clustering algorithm star. Complex Intell Syst
Houssein EH, Saad MR, Hashim FA, Shaban H, Hassaballah M (2020) Lévy flight distribution: A new metaheuristic algorithm for solving engineering optimization problems. Eng Appl Artif Intell 94:103731
Article Google Scholar
Kaliyar RK, Goswami A, Narang P, Sinha S (2020) FNDNet—a deep convolutional neural network for fake news detection. Cogn Syst Res 61:32–44
Article Google Scholar
Kaur S, Kumar P, Kumaraguru P (2020) Automating fake news detection system using multi-level voting model. Soft Comput 24(12):9049–9069
Article Google Scholar
Kavitha D, Ravikumar S (2021) IOT and context-aware learning-based optimal neural network model for real-time health monitoring. Trans Emerg Telecommun Technol 32(1):e4132
Google Scholar
Kesarwani A, Chauhan SS, Ramachandran Nair A (2020) Fake news detection on social media using K-nearest neighbor classifier. In: 2020 international conference on advances in computing and communication engineering (ICACCE), pp 1–4. IEEE
Khattar D, Goud JS, Gupta M, Varma V (2019) Mvae: multimodal variational autoencoder for fake news detection. In: The world wide web conference, pp 2915–2921
Kumar S, Asthana R, Upadhyay S, Upreti N, Akbar M (2020) Fake news detection using deep learning models: A novel approach. Trans Emerg Telecommun Technol 31(2):e3767
Google Scholar
Lara-Navarra P, Falciani H, Sánchez-Pérez EA, Ferrer-Sapena A (2020) Information management in healthcare and environment: towards an automatic system for fake news detection. Int J Environ Res Public Health 17(3):1066
Article Google Scholar
Le T, Wang S, Lee D (2020) MALCOM: generating malicious comments to attack neural fake news detection models. arXiv preprint arXiv: 2009.01048
Li L, Liu H, Zhou H, Zhang C (2020) Missing data estimation method for time series data in structure health monitoring systems by probability principal component analysis. Adv Eng Softw 149:102901
Article Google Scholar
Liu Y, Sun S, Dou L, Hou J (2020) An improved probability combination scheme based on principal component analysis and permanence of ratios model-An application to a fractured reservoir modeling, Ordos Basin. J Petrol Sci Eng 190:107123
Article Google Scholar
Menaga D, Revathi S (2020) Probabilistic principal component analysis (PPCA) based dimensionality reduction and deep learning for cancer classification. Intelligent computing and applications. Springer, Singapore, pp 353–368
Google Scholar
Monti F, Frasca F, Eynard D, Mannion D, Bronstein MM (2019) Fake news detection on social media using geometric deep learning. arXiv preprint arXiv:1902.06673
Mullen LA, Benoit K, Keyes O, Selivanov D, Arnold J (2018) Fast, consistent tokenization of natural language text. J Open Source Softw 3:655
Article Google Scholar
Nirmal Kumar SJ, Ravimaran S, Alam MM (2020) An effective non-commutative encryption approach with optimized genetic algorithm for ensuring data protection in cloud computing. Comput Model Eng Sci 125(2):671–697
Google Scholar
Nisha S, Madheswari AN (2016) Secured authentication for internet voting in corporate companies to prevent phishing attacks. Int J Emerg Technol Comp Sci Electron (IJETCSE) 22(1):45–49
Google Scholar
Okoro EM, Abara BA, Umagba AO, Ajonye AA, Isa ZS (2018) A hybrid approach to fake news detection on social media. Niger J Technol 37(2):454–462
Article Google Scholar
Ozbay FA, Alatas B (2020) Fake news detection within online social media using supervised artificial intelligence algorithms. Physica A Stat Mech Appl 540:123174
Article Google Scholar
Ravikumar S, Kavitha D (2020) IoT based home monitoring system with secure data storage by Keccak-Chaotic sequence in cloud server. J Ambient Intell Human Comput 12:7475–7487
Article Google Scholar
Rejeesh MR, Thejaswini P (2020) MOTF: multi-objective optimal trilateral filtering based partial moving frame algorithm for image denoising. Multimedia Tools Appl 79(37):28411–28430
Article Google Scholar
Ruchansky N, Seo S, Liu Y (2017) Csi: a hybrid deep model for fake news detection. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp 797–806
Shu K, Mahudeswaran D, Wang S, Liu H (2020) Hierarchical propagation networks for fake news detection: Investigation and exploitation. Proc Int AAAI Conf Web Soc Media 14:626–637
Google Scholar
Sundararaj V (2016) An efficient threshold prediction scheme for wavelet based ECG signal noise reduction using variable step size firefly algorithm. Int J Intell Eng Syst 9(3):117–126
Google Scholar
Sundararaj V (2019) Optimised denoising scheme via opposition-based self-adaptive learning PSO algorithm for wavelet-based ECG signal noise reduction. Int J Biomed Eng Technol 31(4):325
Article Google Scholar
Sundararaj V, Muthukumar S, Kumar RS (2018) An optimal cluster formation based energy efficient dynamic scheduling hybrid MAC protocol for heavy traffic load in wireless sensor networks. Comput Secur 77:277–288
Article Google Scholar
Sundararaj V, Anoop V, Dixit P, Arjaria A, Chourasia U, Bhambri P, Rejeesh MR, Sundararaj R (2020) CCGPA-MPPT: cauchy preferential crossover-based global pollination algorithm for MPPT in photovoltaic system. Prog Photovoltaics Res Appl 28(11):1128–1145
Article Google Scholar
Umer M, Imtiaz Z, Ullah S, Mehmood A, Choi GS, On B-W (2020) Fake news stance detection using deep learning architecture (cnn-lstm). IEEE Access 8:156695–156706
Article Google Scholar
Vijjali R, Potluri P, Kumar S, Teki S (2020) Two stage transformer model for covid-19 fake news detection and fact checking. arXiv preprint arXiv:2011.13253
Wang Y, Wang L, Yang Y, Lian T (2020) SemSeq4FD: integrating global semantic relationship and local sequential order to enhance text representation for fake news detection. Expert Syst Appl 166:114090
Article Google Scholar
Wardle (2017) Fake news: it’s complicated’. First Draft News, 16 February
Yuan C, Ma Q, Zhou W, Han J, Hu S (2020) Early detection of fake news by utilizing the credibility of news, publishers, and users based on weakly supervised learning. In: Proceedings of the 28th international conference on computational linguistics, pp 5444–5454
Zervopoulos A, Georgia Alvanou A, Bezas K, Papamichail A, Maragoudakis M, Kermanidis K (2020) Hong Kong protests: using natural language processing for fake news detection on twitter. IFIP international conference on artificial intelligence applications and innovations. Springer, Cham, pp 408–419
Chapter Google Scholar
Zhang T, Wang D, Chen H, Zeng Z, Guo W, Miao C, Cui L (2020) BDANN: BERT-based domain adaptation neural network for multi-modal fake news detection. In: 2020 international joint conference on neural networks (IJCNN), pp 1–8. IEEE
Zhao R, Wang Y, Liu C, Hu P, Li Y, Li H, Yuan C (2020) Selfish herd optimizer with levy-flight distribution strategy for global optimization problem. Physica A Stat Mech Appl 538:122687
Article Google Scholar
Zhou X, Zafarani R, Shu K, Liu H (2020) Fake news: fundamental theories, detection strategies and challenges. In: Proceedings of the twelfth ACM international conference on web search and data mining, pp 836–837

Download references

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Department of Computer Applications, MANIT, Bhopal, India
Dheeraj Kumar Dixit, Amit Bhagat & Dharmendra Dangi

Authors

Dheeraj Kumar Dixit
View author publications
You can also search for this author in PubMed Google Scholar
Amit Bhagat
View author publications
You can also search for this author in PubMed Google Scholar
Dharmendra Dangi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dheeraj Kumar Dixit.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dixit, D.K., Bhagat, A. & Dangi, D. Automating fake news detection using PPCA and levy flight-based LSTM. Soft Comput 26, 12545–12557 (2022). https://doi.org/10.1007/s00500-022-07215-4

Download citation

Accepted: 22 February 2022
Published: 16 June 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s00500-022-07215-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Automating fake news detection using PPCA and levy flight-based LSTM

Abstract

Similar content being viewed by others

ConFake: fake news identification using content based features

Detection of Fake News Using Clustering Algorithms

Fake News Detection in Mainstream Media Using BERT

Explore related subjects

1 Introduction

2 Review of related works

3 Proposed methodology

3.1 Data pre-processing phase

3.2 Feature reduction using PPCA

3.3 Classification using LSTM-LF

4 Experiments and discussions

4.1 Experimental configuration

4.2 Parameter specifications

4.3 Simulation metrics

4.4 Dataset description

4.5 Dataset 1: Buzzfeed

4.6 Dataset 2: GossipCop

4.7 Dataset 3: ISOT

4.8 Dataset 4: Politifact

4.9 Performance evaluations

4.9.1 Confusion matrix

4.9.2 Dataset analysis of various metrics

4.9.3 Evaluation results

4.10 Comparative analysis

5 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation