1 Introduction

Millions of people use the internet daily and publish news content on social media platforms like Twitter and Facebook. With so many online sources of information, it can be difficult to determine which content is based on facts and which is misleading. Using digital platforms to spread false information can have a powerful and far-reaching impact, influencing others to accept it as fact. Fake news can also be used to provoke and exacerbate social conflict, impacting all areas of society. Its impact is particularly significant when it relates to the health of individuals, such as during the COVID-19 pandemic this virus affected almost 10 million people in the world [1, 2]. Generally, using machine learning (ML) and deep learning (DL) methods can significantly aid in detecting fake news content on social media platforms. These methods have proven valuable in addressing various real-world challenges such as sentiment analysis [3, 4], sarcasm [5], etc. They are trained to verify and tag text into predefined labels, such as “positive” or “negative” in case of sentiment analysis. Natural Language Processing (NLP) is a subfield of artificial intelligence that involves using natural language to understand human interaction with machines. In order to interpret the meaning of a text, it is necessary to understand its context. Using domain knowledge to extract useful, meaningful, and high-quality features from the text can improve its representation and lead to more accurate models.

Researchers are trying to find the best ML classifier to determine fake news. The model’s accuracy is essential and must be considered because it can harm different individuals if it fails to detect fake news [6]. These models’ performance depends mainly on the data preprocessing [7] and the features’ quality in the training phase [8]. It has been proved that leveraging features engineering into ML classifiers can enhance the classifiers’ performance and increase their accuracy [9]. Thus, this work focuses to increase the performance of the traditional ML models, i.e. Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), and Gradient Boosting Decision Tree (GBDT) in detecting fake news by enriching its input features extracted by Term Frequency –Inverse Document Frequency (TF-IDF) with more text-representative features. This approach is evaluated on the COVID-19 benchmark dataset and shows the impact of adding extra context features to enhance the models’ overall performance. The main contributions of this paper are as follows:

  • Extracting eleven context features from each tweet and investigating and analyzing its impact on the performance of the ML models.

  • Experiments have carried out the evaluation of four ML classifiers with each of the eleven extra features to identify fake news on the publicly available COVID-19 dataset.

  • Comparing the performance of this approach to the baseline models (without extra features).

The remaining content is organized as follows: the related works are explored in Sect. 2. The proposed work methodology is described in Sect. 3. The experimental setup is discussed in Sect. 4. Section 5 involves the results and discussion, whereas, the conclusion and future work in Sect. 6, followed by references.

2 Related work

Starting with the baseline study of this work [10], which acquired COVID-19 tweets from different online sources and applied ML classifiers such as SVM, LR, DT, and GDBT. The results revealed that SVM achieved superior results in validation and test datasets. In [11] several ML and DL models were compared to identify disinformation about COVID-19 automatically. The experiments were conducted on two datasets, and the results were evaluated using various metrics. The results showed that traditional ML models performed better than DL models in predicting fake news. Specifically, both Random Forest (RF) and LR had superior results compared to other models. Additionally, Long Short-Term Memory (LSTM) performed better than Convolutional Neural Network (CNN). Similarly, [1] has experimented various ML and transformer models, including Naïve Bayes (NB) and SVM models, as well as Bidirectional Encoder Representations from Transformers (BERT), DistilBert, and Roberta, with TF-IDF and word2vec representation methods. They found that SVM performed the best among the other models when used with TF-IDF. However, using Word2vec decreased the performance of the models. Additionally, the transformer models showed the best accuracy and f1-score results. Variant classifiers ranging from traditional ML and DL, along with different extraction techniques like TF-DF with n-gram were evaluated on four COVID-19 fake news datasets by authors in [12]. The results demonstrated significant achievement by the baseline compared to the existing state of art. Other works emphasized the need for feature engineering to efficiently address fake news detection, such as [9], which used five DL models and features engineering, such as emotion, and features, including term frequency, stop word count ratio, and average sentence length. Also, [13] used ML classifiers with linguistic features, such as n-grams, readability, emotional tone, and punctuation, and found that linear SVM performed the best with an f1-score of 95.19% on the unseen set. In a different study [14] various experiments were conducted using ML and DL models to detect fake content. NLP features such as the number of mentions, hashtags, and tweet length were extracted from tweets and used as metadata in the models. The performance of the models is evaluated using each feature, and found that this approach slightly improved the English dataset with an f1-score of 0.93%.

3 Proposed work methodology

Figure 1 introduces steps followed in conducting fake news detection which explained in details in the sub-section below:

Fig. 1
figure 1

Proposed work methodology steps

3.1 Proposed work components

3.1.1 Dataset description

COVID-19 dataset [10] consists of tweets regarding COVID-19 pandemic. Each tweet has a label indicating whether the tweet’s is real or fake. It contains three CSV files: train, validation, and test, that includes 6420 samples, 2140 samples, and 2140 samples, respectively.

3.1.2 Preprocessing

Here, NLP techniques are used to minimize noise by removing irrelevant data for fake news classification such as, lower casing, removing URLs, replacing symbols and tags, and removing stop words [15]. This ensures that the data is properly prepared for feature extraction and further analysis.

3.1.3 Feature engineering and extraction

Feature engineering It is considered the most essential part of text classification. Different types of features can be engineered from the given dataset, which can be used in the classification models are described in Table 1.

Table 1 Feature list description

Term frequency-inverse document frequency (TF-IDF) It encodes any type of text as a statistic number indicating the frequency of each word or phrase throughout the whole document [16]. It is considered a text vectorizer that converts the provided text into a numerical vector. Each value in this vector is calculated as the following formula, which multiplies two concepts, TF and IDF [17]

$${w}_{i,j}={tf}_{i,j}\times \mathrm{log}\left(N/{df}_{i}\right)$$
(1)

where TF represents how many times a given word appears in the text divided by the total no. of words in the exact text. In comparison, IDF is the log of the no. of documents divided by the no. of documents that contain the word. It specifies the weight of rare vocab among the dataset, as in the following formulas [18, 19].

$${{tf}_{i,j}=n}_{i,j}/{\sum }_{k}ni,j$$
(2)
$$idf\left(w\right)=\mathrm{log}\left(N/{df}_{t}\right)$$
(3)

3.1.4 Model building

Logistic regression (LR) It is a probability-based predictive analytic algorithm that uses a statistical model based on the sigmoid or logistic function. When given a real-valued input, the output of an S-shaped curve is mapped between 0 and 1. Where 0 is the bias or intercepts term, and 1 is the coefficient for the independent variable [20].

Support vector machine (SVM) It is a supervised ML method for classification tasks [21] that creates a straight line separating samples of two classes with the highest margin. It works in an N-dimensional space, making the line as far away from the closest data points as possible. It is suitable for regression and classification tasks [22, 23].

Decision tree (DT) It is a robust and more popular supervised learning method due to its easy understanding [24] and implementation. Like SVM, DT can be used for regression and classification tasks and works well with numeric and categorical data. It works by separating the given dataset into small sets according to criteria, and the tree is built incrementally. The leaf nodes of a decision tree represent the classification results [25].

Gradient boosting decision tree (GBDT) It involves using an algorithm for gradient lifting and an algorithm for decision trees to correct the errors made by its predecessor. The primary function of gradient boosting is to reduce residuals or to generate a decision tree in the direction of a negative gradient to minimize final residuals. The fundamental principle of boosting theory is to continuously decrease the loss function as the model is established, meaning that the model is continually being optimized [26].

3.1.5 Performance evaluation

The objective of the performance evaluation step is to evaluate the performance of the generated models on unseen data. For this purpose, we utilized accuracy, precision, recall, and f1-score performance evaluation metrics which are calculated using the functions available in the Python Scikit-learn Metrics module [3, 4, 27].

3.2 Proposed work algorithm

figure a

4 Implementation

Initially, we collected the dataset [10] related to COVID-19 fake news from Kaggle and performed various feature engineering techniques to construct additional features for the dataset to help ML models identifying different patterns. We applied the MinMax scaler, a standardization technique from the Scikit-learn library, on all of the extracted features to ensure that they were all in the same range of values for more ML performance efficiency. The text data was preprocessed to remove irrelevant words and characters followed steps in [10], and the TF-IDF technique was used for feature extraction. These preprocessing and feature extraction steps discussed earlier were also applied to the training and validation data. It is applied to each tweet to calculate that tweet vector. This vectorization results in a matrix representing each sentence as a vector. The vector has the same length as our vocabulary. We experimented with several ML models including those suggested on the model building section. We applied these models to the text data to form baseline results. Then, additional features were added, each impact is evaluated on the performance of the models. To evaluate and compare the performance of the models, a test is conducted on a separate validation set to estimate how well the model generalizes to unseen data on the suggested evaluation metrics.

5 Results analysis and discussion

5.1 Experiments results

Table 2 and Fig. 2 show the results of the SVM classifier on the validation dataset. It is shown that enriching the model with individual extra features can enhance the model slightly. Specifically, the “subjectivity” feature improves the baseline model on all performance evaluation metrics by ~ 0.3%. Table 3 and Fig. 3 similarly, show the results of the LR classifier, which demonstrated that utilizing “polarity” feature can enhance the model slightly in terms of accuracy. Some other features have the same accuracy as the LR baseline model but improve the precision performance a little bit by ~ 0.02.

Table 2 The impact of features engineering on the performance of SVM classifier
Fig. 2
figure 2

The performance of SVM on the validation dataset based on the engineering features

Table 3 The impact of features engineering on the performance of LR classifier
Fig. 3
figure 3

The performance of LR on the validation dataset based on the engineering features

Also, Table 4 and Fig. 4 shown that enriching the DT model with the “Char count” feature can enhance the model accuracy by a good margin, 1%. Finally, Table 5 and Fig. 5 demonstrated that enriching GBDT model with most of the features can improve the model by a good margin, ~ in the range of [1, 2%] in terms of accuracy except “unique word count” and “subjectivity,” which got less accuracy than the baseline.

Table 4 The impact of features engineering on the performance of DT classifier
Fig. 4
figure 4

The performance of DT on the validation dataset based on the engineering features

Table 5 The impact of features engineering on the performance of GBDT classifier
Fig. 5
figure 5

The performance of GBDT on the validation dataset based on the engineering features

5.2 Discussion and comparison

Prior research has explored various approaches for fake news detection as in related work section, some of which were applied to the same dataset used in this study [10, 11]. Specifically, the proposed approach distinguishes itself from existing methods [10, 11] at the component level. Unlike [11], which used distinct ML with tfidf features and additional deep learning models (CNN and LSTM) with Glove (the later exhibits low performance compared to traditional ML models), and [10] that utilized the similar models and tf-idf features, but this approach additionally introduced thirteen knowledge base features, enhancing the model’s ability to prioritize knowledge base-relevant characteristics extracted from the text. This innovation sets our approach apart from methods primarily relying on fixed feature extraction techniques and traditional deep learning models, resulting in improved performance and generalization in the detection of COVID-19 fake news.

In the comparative analysis Table 6, baseline_1 [10] consistently demonstrates strong performance across various models, achieving high levels of performance in all the measurement metrics used, while baseline_2 [11] exhibits comparable performance in terms of LR but showcases a distinct pattern with high recall and lower accuracy and precision for SVM, i.e., due to the “gamma” parameter that set as a kernel instead of “linear”. Also, compared to the DT in baseline_2 [11], our approach achieves slightly lower accuracy at 86.21% versus 85.23%, but maintains consistent in the other matrices. Conversely, when compared to baseline_1 [10], our approach demonstrates ~ 1and ~ 2% high margin using DT and GBDT, respectively, in all the matrices. Interestingly, our variations consistently outperform their corresponding baselines, with the introduced features notably enhancing model effectiveness. The careful selection of features, including “subjectivity,” “polarity,” “char_count,” and “capitalchar_count,” significantly contributes to improving baseline model performance. Ultimately, the proposed approach yields superior results compared to methods solely relying on the baseline models baseline_1 [10] and baseline_2 [11]. Performance comparisons among these algorithms are summarized in Table 6.

Table 6 Performance Comparison with the baseline models

6 Conclusion

To summarize this work, it focused to increase the performance of the traditional ML models in detecting fake news related to COVID-19 pandemic by enriching its input features extracted by TF-IDF with more text-representative features. Firstly, the investigation of employing extra context feature has been carried out for all baseline models and found that: different features affect the performance of different classifiers, applying a scaler to the extracted features can enhance the model’s performance. In addition, SVM and LR have been improved slightly, whereas DT and GBDT have been improved with a good margin. Moreover, “Char count”, “Word count”, and “unique words” are the most representative features among all others. Finally, this innovative approach has consistently outperformed alternative strategies reliant solely on baseline TF-IDF and Word2Vec techniques. This paper makes the research open to investigate multiple features and advance deep learning models in detecting fake news.