1 Introduction

Visual data attracts viewers more quickly than words do. A Human brain captures and rapidly analyzes a news item and often flags it as fake or real by just a glance of its title, image, or a small segment of it, mostly without going through the entire textual content. It does this based on the preexisting knowledge in our conscience. Even if it does go through a whole text, there are very few references and not enough time to check for the authenticity of the content we come across. Various content creators exploit these drawbacks of the human brain and behavior. There is a need for technological state of the art methods to assess the credibility of content, textual or visual, and authenticate it as fake or real. Online media emerged as a platform to share ideas, views, news. With the advancement of mobile devices and the internet, news became easily accessible to people who were either deprived of or uninterested in official news sources such as television and newspapers. The long and seemingly tedious to read texts became easy to understand as images and videos now accompany them. In the same process, it also became challenging to detect the truth in such content.

In the present scenario, online media is losing its charm and credibility as content creators lure users to gain popularity and money using the content they post online. In this process, they do not pay heed to the authenticity of the information, ignore the verification process, and mix up misleading or tampered images or clips with the texts. Content creators focus on posting catchy and attractive content that bags them many likes, comments, and dollars. Sometimes both the text and graphic content are intentionally made erroneous to spread fake news, making the entire content even more unrealistic. Hence there is an urgent need to design and develop a new classification method to assess the credibility of content, textual or visual, and segregate it as fake or real. If textual and visual factors are taken collectively, fake news detection methods have proved to provide higher accuracies than unimodal detection methods. Machine learning and deep learning-based detection mechanisms depend on fake or real news by analyzing the text’s features and visual data. Users consuming information play an essential part in stopping the spread of fake content at the root level or circulating to reach a great mass affecting political, social, and economic lives. The algorithms so far used depend upon news data collected from websites and social media platforms, which are later classified into binary (real and fake) or multiple (ranging according to their severities) labels by crowdsourcing or third-party authenticators.

With the advent of massive data and news content online, the intricacies add up when multiple data forms are available. Despite being beneficial in terms of easy transmission and news consumption, multi-modal data also presents a strenuous task for detecting fake news amongst them. The modalities prevalent on online media include text, image, audio, video, and hyperlinks. With the vast accompaniment of text with visual data, the effectiveness of news rises. A large amount of visual data makes verification difficult as multi-modal data does not guarantee the credibility and attracts more attention than pure text contents. Multi-modal features are expected to be more beneficial in detecting fake news as compared to unimodal features. Few of the excellent quality datasets available for scientific research include binary labeled datasets and multi-label datasets such as Mediaeval, Sina Weibo, PolitiFact, Emergent, and Resized_V2 [1].

Figures 1 and 2 represent critical knowledge predominantly available in text and image parts of information circulating online. We propose that online social media images consist of three features: latent features, explicit features, and contextual features. Latent features are extracted using layers of convolutions. Deep convolutional networks are capable of learning kernel values that are utilized to extract latent features. According to Yang et al. [1], explicit features are hand-crafted features such as the resolution of an image and the number of faces in the picture. Apart from these two intrinsic features, contextual features are based on semantic relationships between the text and the image. We have executed convolutional neural networks for text and image classifications. CNNs provide an advantage to extract features directly from raw input without any pre-processing required. CNNs reduce input data on various layers such that only required information is preserved and worked upon to make essential predictions. In this work, we propose a novel fake news detection framework. It is based on two-stream convolutional neural networks for text and image input streams. This novel architecture consists of individual text and image classification modules, which are fused at a later stage post-training of convolutional models. The experiments performed resulted in ~3–6% higher scores than the established state-of-the-art methods. The proposed architecture is capable of detecting fake news based on both textual and visual information. The usage of Text-CNN increases the overall efficiency of the architecture. Simultaneously, the combination of Image-CNN has resulted in an additive accuracy for the detection task. The use of convolutional models that we propose with introduced Text-CNN and Image-CNN models outperform the existing state-of-the-art.

Fig. 1
figure 1

Text features

Fig. 2
figure 2

Image features

The contributions of this work include:

  • Web scraping, creating clean-image datasets from two previously available datasets that contained news URLs.

  • We have proposed a new Coupled ConvNet architecture that constitutes proposed Text-CNN and Image-CNN modules for multi-modal fake news detection.

  • We have implemented CNN models on TI-CNN, Emergent, and MICC-F220 dataset on textual and visual data.

  • We have performed a comparative analysis of various CNN models’ efficiencies on real-world datasets for fake news detection.

  • We have analyzed the performance of deep learning on latent textual and visual features for fake news detection.

  • We have provided new deep learning pathways to better fake news detection.

The paper’s organization is as follows: Section 1 introduces the problem statement, the need for multi-modal fake news detection, datasets available, and modality features. Section 2 discusses the previous works performed on fake news classification and detection using various CNN and RNN models. In section 3, we present the mathematical background of CNN architectures we have utilized. Section 4 explains the proposed Coupled ConvNet architecture, its constituent modules, namely, Text-CNN and Image-CNN, and our methodology. Section 5 describes the datasets, experiments, results analysis, and baseline comparison of our work. In section 6, we discuss potential research directions and concludes.

2 Related works

Fake news detection challenges include the usage of multi-mal data to classify real and fake news. Present methodologies include fake news detection on textual content [2,3,4]. Research shows that the incorporation of visual data improves fake news detection. With the rise of multi-modal content on users’ posts and news contents, studies involving detection using visual data have rapidly increased.

Previous research [2, 5] includes studying image features of visual data like accompanying images, type of image, etc. Other investigations include learning forensic features [6, 7]. Text information is fused by Jin et al. [8] to get better detection using attention mechanism with RNN on image and LSTM on the text and social context to obtain features and perform rumor detection on microblogs. Qi et al. [9] combined Recurrent Neural Networks to detect and interpret real and fake photos semantically. They introduced a novel approach called Multi-domain Visual Neural Network (MVNN). It uses CNN to extract frequency-domain patterns and CNN-RNN to extract pixel domain patterns and fuses using an attention mechanism outperforming state-of-the-art methods by 9.2%. Researchers have provided various forensics tools and techniques to identify image manipulations. Mostly used methods include detecting physical cues within the image.

Recent works have traversed towards deep learning techniques rather than using available prior knowledge about the data. Using labeled training data is specifically advanced for fake news detection. Previous studies focused on linguistic and textual data to study fake news characteristics and semantics of the data. Deep Neural Networks have been utilized to check tweets for temporal-linguistic traits [3]. Attention mechanisms have also been used with RNNs for fusion [10]. Liu and Wu [11] modeled the classification with a combination of CNNs and RNNs. Less focus has been given to the credibility of multi-modal data on the web. Text and images can be well represented using deep neural networks. Jin et al. and Wang et al. [8, 12] applied it to fake news detection.

To overcome the limitation of learning shared representation of multi-modal data, Khattar et al. [13] proposed a Multi-modal Variational AutoEncoder. It is coupled with binary classifier features of text and image modalities with three components in the model, an encoder, decoder, and a fake news detection module. The model leverages state-of-the-art techniques with ~6% accuracy. Ajao et al. [14] used a hybrid of CNNs and LSTM-RNNs to identify fake news-related features without prior knowledge, achieving 82% accuracy. Jindal et al. [15] presented two novel datasets containing fake news text and image, using data augmentation to increase fake news data. Singhal et al. [16] perform fake news detection by introducing the SpotFake framework that exploits textual and visual features of news posts without considering subtasks such as event discriminator and modality correlations. The model increases the accuracy from previous approaches by 3.27% and 6.83% on Twitter and Sina Weibo datasets. TI-CNN has been proposed for fake news detection by Yang et al. [1] using Convolutional Networks on both textual and visual data. They have incorporated both explicit and latent features extracted for both the modalities using CNN layers.

A new challenge emerged to detect fake or computer-generated images with technological advancement in Generative Adversarial Networks. GANs pose a threat by allowing the creation of fake images and manipulations in existing images. Marra et al. [17] studied the performance of existing detectors that use conventional and deep learning methods, concluding higher efficiencies by deep learning detectors with 89% accuracy. They compared the performances of traditional and deep learning image forgery detectors on a dataset of 36,302 images under compression and without compression, concluding that high accuracies are obtained on compressed data using deep networks like XceptionNet, InceptionV3, and DenseNet. In recent years, the yearly trend of published articles using deep networks for credibility analysis is represented in Fig. 3. Figure 4 offers the percentage of fine-tuned CNN models in similar tasks.

Fig. 3
figure 3

Yearly trend of research works

Fig. 4
figure 4

CNN architectures used in previous research

By extracting event-invariant features, proposing event adversarial neural networks, Wang et al. [17] performed fake news detection on newly arrived events. Three tasks are completed, namely feature extraction, detection, and event discrimination. The study is conducted by ignoring features that are event-specific and considering just the shared features. It provides accuracies of 71.5% and 82.7% on Twitter and Weibo, respectively. Sabir et al. [18] detected image repurposing, i.e., manipulations in image meta-data on a self-proposed MEIR dataset that consists of real-world Flickr data. It proposes a multi-modal deep learning method that utilizes metadata and image information to identify modifications.

Pomari et al. [19] came up using CNNs and illumination maps in images to detect splicing in fake images with a colossal accuracy of more than 96%. Another approach used diverse modalities, including text, image, and source, to detect hoaxes [20]. Bayar and Stamm [21] developed a new convolutional layer that learns features from training data suppressing image features and highlighting manipulation features. This new approach can detect image manipulations with an accuracy of 99.10%. Lago et al. [22] performed the task using image forensics algorithms to see tampered images and a verification mechanism to check if the images are rightly mapped to textual news. In 2019, Cui et al. [23], a detection framework named SAME, exploits user comments and latent sentiments and uses an adversarial mechanism. Volkova et al. [24] performed a qualitative and quantitative analysis of fake news classification models, proposing a qualitative analysis tool ERRFILTER. Modalities analyzed are text, lexical and image inputs, and their combinations.

In image classification, Tariq et al. [27] detected fake face images generated by humans and machines using CNN-based models including VGG16, VGG19, ResNet, DenseNet, NASNet, XceptionNet, ShallowNet, and their ensembles. These neural networks detected GANs and human-generated fake face images without using their metadata. The highest accuracies on various image sizes were obtained with Ensemble ShallowNet (V1& V3).

Sabir, Cheng, et al. [25] performed the detection in manipulations of faces in videos using recurrent convolutional models. These models have proved beneficial in utilizing temporal information in still images to detect tampered images improving the existing accuracies by up to 4.55%. Fake video detection has been performed by Guera and Delp [26], using a convolutional LSTM model on a large dataset of deep fake videos in which face swaps have been done. Papadopoulou et al. [30] verified real-time, user-generated online videos, YouTube videos taking their context into account. The information exploited includes video comments for textual data and metadata like video description, likes, dislikes, and uploader information.

3 Methodology

3.1 Overview

This section elaborates on the architectures of the classification models utilized in this task. Proposed Coupled ConvNet is composed of Text-CNN module for textual fake news classification and Image-CNN module for visual fake news classification. We pre-process input data at their earlier stages in both modules and feed them to convolutional neural networks. This section explains the architectures and mathematical background of Text-CNN and other CNN models utilized in this work. Table 1 summarizes different Neural Network architectures used for the credibility analysis of data in various modalities.

Table 1 ConvNet architectures for credibility analysis of different data modalities

3.2 Text-CNN

CNNs are widely used for visual tasks. For image classification, pixel information extracted from images is propagated as pixel values to consequent convolutional layers. Words are needed to be processed to make them understandable by a machine. A computing machine treats visual and textual data in the same manner as numeric data. The idea is to serve the machines with text in numeric data in the same way visible data is treated using pixel values. This task is performed by embedding words into vectors. Figure 5 details the various layers of text-CNN architecture.

Fig. 5
figure 5

Text-CNN architecture

A fixed vector can thus represent each word in the sentence. These embedded vectors are then propagated through convolutional layers in the same way image data moves through the deep network. The consequent layers are of the same structure incorporating max-pooling, padding, activation function, fully connected layers, and dropout. It is mathematically represented in the form of a k-dimensional vector as xi ∈ Rk, where xi is the ith word in a sentence.

Then, x1 : n = x1 ⨁ x2 ⨁ … ⨁ xn, (x1 : n is a sentence of length n) and ⨁ represents concatenation. The series of words xi, xi + 1, …, xi + j are concatenated as xi : i + j. If h is the number of words, a filter w that is applied to the text generates a feature ci from a word window xi : i + h − 1 where filter w ∈ Rhkand ci = f(w · xi : i + h − 1 + b) given b as a bias term and f, a non-linear function. The filter w is applied to every word window, producing a feature map c = [c1, c2, …, cn − h + 1] where c ∈ kn − h + 1. A max-pooling layer is applied next. It extracts the feature with maximum value in the feature map c which is expressed as \( \hat{c}=\max \left\{c\right\} \). These features with maximum values are propagated further to fully connected layers passing them to a softmax layer for classification.

3.3 Image-CNN

The fine-tuned CNN architectures provide good accuracies when it comes to extracting hidden image features and patterns. We implemented eight different CNN architectures AlexNet, Xception, VGG16, VGG19, ResNet50, MobileNetV2, InceptionV3, and DenseNet, for visual fake news detection. The designs of all fine-tuned image CNN models used in this work are described in the following section and represented in Fig. 6.

Fig. 6
figure 6

CNN model architectures: (a) AlexNet, (b) Xception, (c) VGG16, (d) VGG19, (e) ResNet50, (f) MobileNetV2, (g) InceptionV3, (h) DenseNet

AlexNet

AlexNet is a Convolutional Neural Network designed by Alex Krizhevsky in 2012 and won the ILSVRC challenge. The model displayed that depth in the network is necessary for efficient applications. Depth in the model contributed to providing high performance and became computationally costly, sufficed by using multiple GPUs. AlexNet architecture consists of 8 layers. The first five are convolutional layers, with each layer optionally being followed by a pooling layer and the last three layers are fully connected layers. The model prefers the ReLU activation function, owing to its advantage in training time over tanh or sigmoid functions. Overfitting encountered in AlexNet was reduced by data augmentation and using Dropouts that turn off neurons with a specified probability of 0.5.

VGG

Visual Geometry Group (VGG) won the ILSVRC 2014 competition. The group members, Karen Simonyan and Andrew Zisserman, experimenting with multiple numbers of layers in the deep network, released two versions of their model, VGG16 and VGG19, with 16 and 19 deep network layers each. They displayed that deeper networks with a more significant number of layers result in higher accuracy for image classification tasks. They replaced large kernel-sized filters of sizes 11 × 11 and 5 × 5 with smaller filters of size 3 × 3. Three fully-connected layers follow the convolutional layers following a softmax layer. ReLU is used as the non-linear activation function for hidden layers. The number of channels increases with a twice-factor from 64 in the first layer to 512 in the last layer. The increased depth makes VGG a network slower to train.

ResNet

Residual Neural Network is a network simplified by skipping layers introduced by Kaiming He in 2015. ResNet makes double and triple layers skips jumping across the network. This network makes training more comfortable and faster and reduces the vanishing gradient problem as there is a lesser number of layers in the network. It uses the ReLU activation function and Batch Normalization. Activations are reused from a previous layer until the current layer learns the weights.

Layers are indexed as l − 2 to l for single skips in backward propagation and as l to l + 2 for forward propagation. Given k − 1 as the skip number, this can be generalized as l − k for a backward skip and l + k for a forward skip. A residual network building block with residual function F(x) can be defined by the equations:

For equal dimensions of x and F,

$$ y=F\left(x,\left\{{W}_i\right\}\right)+x $$
(1)

and

For unequal dimensions,

$$ y=F\left(x,\left\{{W}_i\right\}\right)+{W}_sx $$
(2)

Here x is the input vector, and y is the output vector, F(x, {Wi}) is the residual mapping and Ws is a linear projection used for mapping dimensions.

Inception V3

The inception V3 model by Google for image classification was presented in ILSVRC 2015, providing a low error rate due to a 42-layer deep network. This model uses the factorization method to factorize a 5 × 5 convolution into two 3 × 3 convolutions. It reduces the parameters by 28%. Similarly, a set of one 1 × 3 and one 3 × 1 convolution can be replaced by a 3 × 3 convolution. The auxiliary loss tower in Inception V1 is used only on the last 17 × 17 layer as a regularizer in Inception V3. Inception V3 is observed to be much efficient than VGGNet in terms of computation cost.

Xception

Xception stands for “Extreme Inception,” Its architecture is entirely based on depthwise separable convolutional layers. Its architecture consists of 36 convolutional layers (as 14 modules) followed by fully connected layers and a logistic regression layer. Except for the first and last modules, all convolutional layers have residual connections. The weight decay rate or L2 regularization of the Inception V3 model was improved to 1e − 5, and the dropout layer used a probability of 0.5. The model does not incorporate the ‘Auxiliary loss tower’ that is optionally used in Inception V3 architecture.

DenseNet

DenseNets, introduced in 2018, are residual networks with various parallel skips. Each layer in a DenseNet is connected in a feed-forward manner to every other layer. The expression gives the total number of direct connections between the layers \( \frac{L\left(L+1\right)}{2} \), where L is the number of layers. DenseNets do not require the learning of repeated feature maps and require a lesser number of parameters. They perform concatenation of feature maps instead of sum. Its equation can be stated as:

$$ {x}_l={H}_l\left(\left[{x}_0,{x}_1,\dots, {x}_{l-1}\right]\right) $$
(3)

Here xl is the output of lth layer and Hl is a non-linear transformation.

MobileNet V2

MobileNet V2, a type of CNN, was specially designed in 2019 for mobile devices based on inverted residual connections and bottleneck light-weight depthwise separable convolution layers. The first layer of MobileNet V2 is a convolutional layer with 32 filters. Nineteen residual bottleneck layers follow it. The kernel size used is 3 × 3, and the non-linear activation function used is ReLU6. The residual layers are used to make the model memory efficient. The bottleneck block operator used can be expressed as:

$$ \kern3em F(x)={\sum}_{i=1}^t\left({A}_i{}^{\circ}N{}^{\circ}{B}_i\right)(x) $$
(4)

where, Ai is a linear transformation, N is a non-linear transformation and Bi is a linear transformation to the output domain.

4 Proposed Coupled ConvNet

The proposed approach to fake news detection extends the utilization of convolutional neural networks to a broader scale to automate fraudulent content detection on the web. Most of the existing literature is flooded with singular modality tasks where one of the present features are exploited. Most of the approaches are based on machine learning algorithms, while others use deep learning such as GRU, LSTM, Bi-LSTM, and other RNNs for text classification. We leverage this task by introducing a new text classification model using a convolutional neural network. With the onset of using visual features, pre-trained CNN networks are in wide use. The proposed image classification model is based on the usage of a pre-trained model with fine-tuning. Fake news detection tasks can be combined based on data modalities. Hence, the Coupled ConvNet introduced in this work is a hybrid two-stream convolutional architecture (based on text stream and image stream) is proposed, which are then combined using a late fusion technique. The architecture comprises of two streams (modules): Text Module (for textual classification) and Image Module (for visual classification). The architectures of these modules are explained in sections 4.1 and 4.2. The combination mechanism used in the proposed Coupled ConvNet is provided in section 4.3. The series of operations performed in both the modules is depicted in Fig. 7. Figure 8 represents the proposed Coupled ConvNet architecture.

Fig. 7
figure 7

Sequence of operations performed

Fig. 8
figure 8

Proposed Coupled ConvNet architecture

4.1 Text module

A raw text dataset undergoes several refinements and analysis procedures before being affirmed for its realness. The first of those processes is pre-processing the text information. Then the word embeddings are generated for the textual content. Upon completion of this step, the embedded vectors are fed to a one-dimensional convolutional model. We then utilize the CNN model on textual data by applying convolutions on text vectors. A series of layers of convolution and pooling are generated to analyze the data features. Finally, all these layers are conjectured to provide a binary output of the data’s information’s authenticity. The results are obtained after training the data under multiple iterations of the proposed Text-CNN model.

We use only the ‘title,’ ‘text,’ and ‘label’ columns from the versatile information present in the datasets. Textual pre-processing involves the following steps: lowercase conversion, punctuation removal, URL removal, numeric value removal, data tokenization, stop-word removal, and stemming/lemmatization. In the next step, we perform Array Padding. Padding is done by calculating the maximum length from the most extended news item present in the array data. The text, which is shorter in length than the full content length, is padded with zeroes. The data is further split into the train, test, and validation sets. This processed data is now encoded, and text and title inputs are embedded using Glove embeddings. These embeddings are added next to the 1-D input layer. We then feed this data to the proposed CNN model. The proposed text classification model consists of three one-dimensional convolutional layers with ReLU activation function, each followed by a max-pooling layer. Subsequent layers are fully connected Dense and Dropout layers. After experimentation with different dropout values ranging between 0.2 to 0.8, the best results were portrayed by setting the value to 0.4 for both the dropout layers. A binary Sigmoid classifier is deployed to generate the predictions.

4.2 Image module

CNNs have shown considerable performance for various image classification tasks. They identify latent features without demanding any extra information. These latent features are present inside an image and are described as resolution, objects, pixel parameters, size of an image, etc. When the image data under examination is combined with other modalities such as text, it classifies real and fake news. For Image Analysis, the available image datasets are created as explained. The datasets consist of URLs of news pages. We use these URLs present in the database to scrape URLs of images present in those news pages, using BeautifulSoup. We download and then zip the fake and real photos from those newly obtained URLs into separate directories to our local access. These image URLs are also added to the datasets corresponding to their respective news. Data folders are uploaded to Google Drive, and the drive is mount to Google Colab. We use the split-folders module to divide the dataset into train, test, and validation sets with 80%, 10%, and 10% fake and real images, respectively, for TI-CNN and EMERGENT datasets. MICC-F220 dataset is split into 60%, 20%, and 20% for training, validation, and testing sets, respectively. A different proportion is used for MICC-F220. This difference in splitting ratios consists of 220 images, with 110 real and 110 fake images. Splitting this dataset into 8:1:1 leads to a minimal number of images in the validation and test sets. It creates a bias in the classification results. To avoid this bias and generate normalized results, this dataset has been split in a proportion that keeps a good number of images for validation and testing. After this, we perform Image Augmentation using ImageDataGenerator. Operations performed during augmentation include rescaling, rotation, shear, zoom, and flipping of images, which improves the quality of the datasets for usage. Image data is then fed to various mentioned CNN models for classification. The CNN training sequence is similar to that of the text convolution sequence except that in this case, two-dimensional convolutions are performed on visual (image) data. We feed visual data to various CNN models separately. The list of multiple models experimented with our data includes AlexNet, ResNet50, MobileNet, DenseNet, XceptionNet, InceptionV3, VGG16, and VGG19 [31,32,33,34]. Accuracy is determined after training the models for a specified number of epochs, and the result trends for training, test, and validation are observed.

The proposed Image-CNN module uses one of the above-mentioned pre-trained models for each experiment. After adding a pre-trained model, a Dense layer of shape 512 with ReLU activation is added. Next, a Dropout layer of probability 0.4 is used. Another dense layer with shape 256 is used next. Subsequent layers incorporate a dropout layer of value 0.2 and a binary classification layer with a sigmoid activation function. The dense and dropout values are chosen decreasingly to avoid the immediate transition to the final classification layer. It allows the input to travel smoothly through the fully connected layers rather than directly jumping to the last layer. As observed during the experiments, using two dropout layers of value 0.4 and 0.2 reduces overfitting considerably and reduces the loss during the training phase, thereby increasing accuracy.

4.3 Text-image fusion

Post-implementation of text and image classification modules separately, this segment fuses the outputs obtained. Prediction probabilities from both the modules are forwarded to a late fusion operation. Late fusion, a scalable and straightforward method, combines the features from multiple streams after the training phase. The decision vectors from each stream are combined using a suitable combinatorial operation. The proposed method uses a weighted fusion approach in which each modality is assigned a weight that determines the contribution of that modality in the final classification decision. Weights are chosen in a way such that maximum classification accuracy is obtained. For a fusion function f :Pt, Pi → Pc where Pt and Pi are two different sets of prediction probabilities that denote the decisions of each stream, the combined probabilities indicated by Pc gives the output decisions after late fusion. Pc is calculated by adding the products of text and image prediction probabilities with their assigned weights Wt (text-weight) and Wi (image-weight). It is expressed as:

$$ {P}_c={P}_t\ast {W}_t+{P}_i\ast {W}_i $$
(5)

Choice of weights is made by experimenting with all possible combinations, varying the weight values between 0.1 to 0.9, with a difference of 0.1 unit. Text and image weights vary inversely. The variety of probabilities that produce the best result are used for each experiment. These weights have been described in Table 2 in section 5.2.

Table 2 Fusion weights that provided maximum classification accuracies

5 Experimental result analysis

This section discusses the datasets utilized in each of the performed experiments. We also describe the results obtained for various experiments on different models and compare their efficiencies obtained for the mentioned datasets that we have used. Results are observed in terms of accuracy score, precision, recall, and F1-score. For efficient baseline comparison of our model’s performance on the MICC-F220 dataset, we calculate TPR (True Positive Rate) and FPR (False Positive Rate). We further compare our work with existing fake news classification tasks performed in the past on the datasets we have used and demonstrate that our model beats all established baselines in textual and visual fake news detection.

5.1 Datasets

Ti-CNN

With the availability of only a few good quality multi-modal datasets, we utilize the already collected dataset that is available online,Footnote 1 used by Yang et al. [1] for a similar fake news classification task. This dataset contains 20,015 news items from websites, with 11,941 items being fake and 8074 being real. The dataset is rich in terms of the wide range of details that it covers. We use all of these news items for the Text-CNN module using their title, text, and label information. For the Image-CNN module, we use image URLs obtained from the dataset in the ‘main_img_url’ column to scrape images from the web. The total number of images extracted from TICNN is 5733, constituting 2612 real news images and 3121 fake news images. The remaining URLs redirected to corrupted web pages or pages removed left us with an image dataset of a smaller size than their corresponding text items. TI-CNN dataset is used for experimentation in both Text-CNN and Image-CNN modules and later in the proposed Coupled ConvNet architecture.

Emergent

Another dataset experimented with is the EMERGENT (FNC) dataset created by Ferreira et al. [35], consisting of a total of 300 claims and 2595 associated articles. We polish this dataset by discarding duplicate news items and removing blank spaces. For the Image-CNN module, we use post URLs to extract image URLs and then scrape images from EMERGENT datasets’ web-pages that led to a clean dataset of 1338 fake and 791 real images. We have made both of these image datasets publicly available. This dataset is also used in both the proposed individual modules and then in the proposed Coupled ConvNet architecture.

MICC-F220

Further, we used the MICC-F220 dataset by Amerini et al. [36] that consists of only real and tampered images, without any other form of data present. We use it with CNN models to identify whether an image is tampered with or original, in short, fake or real. Due to the lack of textual information, this dataset is solely employed in the proposed Image-CNN module. It is used to compare the efficiencies of utilized pre-trained CNN models within the proposed architecture.

5.2 Implementation settings

All experiments have been performed on Google Colab that provides up to 13.53 GB of RAM. It also allocated us 12 GB NVIDIA Tesla K80 GPU hardware accelerator and python version 3. In Text-CNN, we employed RegexTokenizer to extract tokens from news titles and news texts. To reduce the words into their root forms, we used Porter Stemmer and WordNet Lemmatizer. We have utilized Glove representations for word embeddings used in Text-CNN. We have also applied one-dimensional convolutions on title and text and concatenated their layers. We used 0.4 and 0.8 as subsequent dropout values in the experiments. We used a batch size of 64 and have trained the model upon running for 250 epochs. For Image-CNN, we take the image input in size 224*224. Upon setting the dropout value to 0.2, the experiments exhibited a considerable increase in training accuracy. We have used Adam optimizer for all the given models. The batch size is set to 64 instances. The value of batch-size affects the training time of the model. The aim is to maximize the performance of classification models and minimize computation time. Choosing a batch-size less than 64 resulted in higher training time, which made the process slower. Whereas Google Colab did not accommodate a value greater than 64. Therefore, 64 is the perfect fit and is used as the batch size for both text and image modules. We have used binary cross-entropy loss for classifying the item into two categories: real and fake. In the final merging phase, different weights for text and image features as detailed in (Table 2) have been taken for providing the best precision and accuracy of classification.

5.3 Result analysis

This section presents the performance comparisons of all models used in our work for fake news classification on each of the three datasets. The scores are presented as accuracy, precision, recall, and F1-scores. The compared results are shown in Tables 3, 4, 5, 6 and 7.

Table 3 Performance of Text-CNN module on TI-CNN and EMERGENT
Table 4 Performance of Image-CNN module on TI-CNN and EMERGENT
Table 5 Performance of Image-CNN module on MICC-F220
Table 6 Performance of Coupled ConvNet model on TI-CNN and EMERGENT
Table 7 Accuracy comparison of Image-CNN models on all datasets

Table 3 provides comparison values of the Text-CNN module on two datasets, TI-CNN and EMERGENT. The values indicate that CNNs exhibited an outstanding performance for classifying text-based fake news with 96.26% accuracy on TI-CNN and 93.56% accuracy on EMERGENT. Better scores were obtained on the TI-CNN dataset when compared to EMERGENT in all Text-CNN performance scores. It accounts for the larger size of TI-CNN data. More data aids in better training and hence produces better results. Table 4 portrays performance comparison values for eight Image-CNN modules on TI-CNN and EMERGENT. VGG16 and VGG19 performed the best with 82.72% and 81.04% scores, respectively, on the TI-CNN dataset, followed by ResNet50 and MobileNet with 77.54% 73.37% accuracy, respectively. Other Image-CNN models scored below 63% accuracy on the TI-CNN dataset. The top four in terms of precision and F1 score were in the same order as the accuracy on the TI-CNN dataset with VGG16, VGG19, ResNet50, and MobileNet top four best performing models. In Recall scores, Inception V3 bagged 100%, followed by DenseNet, VGG16, and Xception, on the TI-CNN dataset. For the EMERGENT dataset, in terms of accuracy scores, ResNet50 and Xception secured 51.26% each (highest accuracy), followed by DenseNet and MobileNet with 48.65% 46.93%, respectively. VGG16 performed better on TI-CNN, whereas ResNet50 and Xception on the EMERGENT dataset indicate varying importance and reliance on different Image-CNN models regarding variations in the dataset. Table 5 shows the performance of the eight Image-CNN models on the Image-only dataset MICC-F220. Xception with 100% accuracy, followed by VGG16 with 95.05% accuracy, VGG19 with 91.97%, and AlexNet with 91.54% accuracy, lead the table.

Table 6 provides the final output performance figures of the proposed Coupled ConvNet framework on the two datasets. Comparisons based on Accuracy, Precision, Recall, and F1 scores can be inferred from the table. To eliminate complexity in deciphering the best model or the most relevant text and Image multi-modal fake news detection, let us analyze the Accuracy score comparisons between the TI-CNN and EMERGENT datasets. The combination of Text-CNN with VGG16 performed the best on each of these datasets with 98.93% and 94.05% scores, respectively. While Text-CNN and VGG19 combination performed with 98.4% accuracy on TI-CNN as the second best, Text-CNN and MobileNet coupled ConvNet produced 93.98% accuracy on the EMERGENT dataset, being the second-best. Third and fourth-best performance on TI-CNN was observed with DenseNet and InceptionV3 with 97.86% and 97.65% accuracy, respectively, and on EMERGENT, ResNet50, and Xception produced 91.47% and 90.98% accuracy, respectively.

As inferred from Table 2, weights produced the best classification results can be concluded to be 0.5 for both text and image. Text and image both offer an equal contribution to detecting fake news efficiently. In some cases, the participation can be discovered to be 7:3 for text and image data modalities. It highlights text being a necessary component for fake news detection. It is also evident that exploring visual modality is equally essential.

The MICC-F220 dataset consists of tampered and unaltered images. Images under the unaltered category have not been edited in any form, and thus it serves the purpose of efficiently distinguishing between real and fake images. We deduce that CNN models are highly accurate in detecting fake news where the text is classified based on their vector embeddings and images have been tampered with or edited. We propose using combinations of text and image CNN models to detect fake news using multiple textual and visual modalities. Hence, we provide performance comparisons of these models to make a witty selection for counterfeit news detection tasks. The accuracy obtained with the MICC-F220 dataset is as high as 100% using XceptionNet, and the lowest is 59.52% with the ResNet50 model. Other models have also demonstrated outstanding performance with high accuracy values. This performance highlights the need for larger visual and multi-modal datasets with distinguishable latent features.

Table 7 is provided for ease of comparison of accuracy scores of Image-CNN models across the three datasets. It can be concluded that VGG16 is a consistent performer. Xception and MobileNet are observed to be the next best performers. Despite achieving 100% result with the MICC-F220 dataset, Xception displays average performance with the other two datasets. It can be regarded as being slightly inconsistent with varying datasets. Figures 9, 10, and 11 graphically represent achieved comparative accuracy using proposed architectures to aid more straightforward visual understanding.

Fig. 9
figure 9

Accuracy comparison on TI-CNN dataset

Fig. 10
figure 10

Accuracy comparison on Emergent dataset

Fig. 11
figure 11

Accuracy comparison of Image-CNNs on three datasets

We conclude that CNNs perform better when the dataset is comprised of all tampered images. Data with fake images where fake corresponds to false, tampered, old, misleading, and unrelated images perform somewhat lower as CNNs could detect only latent features. For utilizing features contained in all types of fake photos, multi-modal frameworks are needed which can incorporate elements contained in all kinds of counterfeit images. The above best performing models are likely to show better performance over larger training datasets.

5.4 Baseline comparisons

We validate our results with both single modality textual and visual methods and multi-modal methods for a fair comparison of our proposed work with established baselines. We compare the results for each dataset separately in Tables 8, 9, and 10. The proposed task being the first to examine Emergent on a visual basis, we establish a baseline for visual and multi-modal fake news detection on this dataset. Due to the absence of work performed in the visual domain, our work stands first to do so. The methods used for comparison have been re-implemented by reproducing the works of established baselines. All experiments have been performed by providing an environment similar to as mentioned in the existing research. For unimodal tasks, individual comparisons of results are obtained by using a single-stream model individually. Existing works, where a combination of two-stream networks has been proposed, are compared with the results of proposed Coupled ConvNet.

Table 8 Baseline comparison of TI-CNN dataset
Table 9 Baseline comparison on EMERGENT (FNC) dataset
Table 10 Baseline comparison of MICC-F220 dataset

Ti-CNN

On this dataset, Yang et al. experimented with multiple text classification methods: Logistic Regression, GRU, LSTM, and Text-CNN [1]. For the visual domain, Yang et al. used image-CNN with a proposed architecture of convolutional layers. They created a TI-CNN dataset and performed text classification using the embedding layer and one-dimensional convolutional layer. Image convolution is achieved by using a model that contains three convolutional layers. Filter size is kept as 3 × 3. Thirty-two filters have been used, and the layers inculcate the ReLU activation function. All of our text and image models surpass the scores obtained by Yang et al. [1]. Individual text and image models proposed by us provide accuracies higher than those observed by Yang et al. In the multi-modal aspect, our approach obtains the highest F1-score of 98.71% using a combination of Text-CNN and VGG-16, which outperforms the state-of-the-art result by ~6%. It establishes the proposed work as a new baseline for multi-modal fake news detection.

Emergent

Experiments previously performed by researchers used FNC (FakeNewsChallenge) dataset, which has been derived from Emergent. We compare text-classification results of our model with the LSTM model used by Conforti et al. [37], Logistic Regression applied by Bourgonie et al. [38], and an ensemble of multiple methods deployed by Thorne et al. [39]. Usage of the Text-CNN classification model beats these established baselines, providing an accuracy of 93.56%. Visual fake news detection on this dataset has not been performed previously as the dataset was limited to textual information only. We leverage the task to a visual analysis by adding images extracted from page websites and provide a maximum of 51.26% accuracy using ResNet50 and Xception models.

MICC-F220

Earlier tasks on this dataset have incorporated image forgery detection techniques with Amerini et al. [36] demonstrating 100% TPR and 8% FPR. Most of our proposed model methods have displayed 0% False Positive Rate, and XceptionNet provides 100% True Positive Rate outperforming all other baselines. 0% FPR demonstrates that no fake samples were wrongly classified as real during the testing phase, and 100% TPR shows that all unaltered samples in the test set were classified into the correct class. A model that achieves 0% FPR and 100% TPR is a perfect classifier. With the proposed approach, the Xception model is the ideal classifier for this dataset, classifying all test samples into correct classes.

6 Conclusion and future work

A novel Coupled ConvNet architecture is proposed comprising of Text-CNN and Image-CNN modules. This work accomplishes fake news detection using several convolutional models on text and image data. Our first contribution provides image datasets for counterfeit news detection, which we have publicly available on KaggleFootnote 2,Footnote 3. We compare the performances of image classification models, namely AlexNet, ResNet50, DenseNet, MobileNet, Xception, InceptionV3, VGG-16, and VGG-19, on three real-world datasets TI-CNN, EMERGENT, and MICC-F220. Text-CNN module has been used over TI-CNN and EMERGENT and Image-CNN module on all of the above datasets. We have trained these models and obtained their training, validation, and testing accuracy scores. We utilized latent features for fake image classification and analyzed how well classification can be performed, comparing various efficiencies. All of our models have surpassed fake news detection baselines with high results. The proposed architecture provides a new fake news detection method using convolutional neural networks and establishes a new baseline in this domain. Our proposed model would function more efficiently on larger datasets. We intend to apply these models to larger datasets further. We are also motivated to tune further the parameters used in these models to enhance classification accuracy. Additionally, we focus on coming up with an efficient classification model based on CNN’s with fine-tuned hyperparameters serving greater accuracies and better fake news detection.