1 Introduction

Along with the advancement of new technology and proliferation of internet, various types of information are being disseminated with an unprecedented speed (Kumar et al., 2021; Rubin, 2017). Traditional media are interwoven with online social media, and patterns of information dissemination have been completely changed (Zhou et al., 2021). The masses do not only receive information passively, but also actively participate in producing and sharing information (Fast et al., 2018; Patacconi & Vikander, 2015). However, unverified information that may be false spreads just like accurate information through the internet, thus possibly going viral and having a significant influence on public opinions and decisions (Bondielli & Marcelloni, 2019; Yuan et al., 2018; Zhang & Ghorbani, 2020). Misinformation (Fallis, 2015) generated and disseminated on social network platforms is called misinformation in social networks. Fake news is one of the representatives. Rumor is a social phenomenon in which news content or subjective opinions and comments are spread through social networks (Bondielli & Marcelloni, 2019; Kumar & Shah, 2018). The authenticity of the rumor content is unproven, but when it is proved to be wrong, the rumor becomes misinformation. Fake news is deemed one of the most popular forms of false or unverified information, and it is necessary to determine fake news as early as possible for decreasing its detrimental impacts on the society as a whole.

One potential reason for broad dissemination of fake news is short of adequate knowledge and competency among the masses. Public community is unaware of the credibility of information sources and the authenticity of the news that is read (Castillo et al., 2011; Gupta et al., 2014). Another reason is that automated fact-checking methods are not available. Despite the fact that multiple websites inclusive of Politifact, AltNews, and so on have made great efforts on fake news detection, most of them are dependent on manual approaches which are time consuming. It is too slow to avoid initial dissemination of different fake news (Singhal et al., 2019). Previous studies on fake news detection were mainly concentrated on extracting false information from text content of news (Bondielli & Marcelloni, 2019; Faustini & Covões, 2020; Reddy et al., 2020; Reis et al., 2019). However, fake news is commonly produced by using diverse forms such as image and text. There is a necessity of multimodal approaches for detecting fake news. After the extraction of multimodal features from text, image, audio, video and so on, feature fusion becomes very crucial for effective detection of false information from news.

Fake news detection was concentrated on determining whether a piece of news was true or false (Goldani et al., 2021; Helmstetter & Paulheim, 2018; Singhal et al., 2019), which was considered to be a binary classification problem. As a matter of fact, false information within various news is complicated. There are multiple categories of fake news such as misleading content, imposter content, fabricated content, false connection, false context, satire or parody, and manipulated content (Nakamura et al., 2019). Fine-grained object detection of fake news is conducive to better understanding of the degree of fakeness (Fung et al., 2021; Goldani et al., 2021). It is helpful for researchers to further explore and identify characteristics of a specific category of fake news. Many pieces of news with varying degree of fakeness may be created with distinct motivations, and will disseminate through different paths or patterns.

Therefore, this research was motivated to establish a multimodal approach integrating RoBERTa with DenseNet through co-attention mechanism (MRDCA) with the purpose for fake news detection based upon fine-grained categorization. The model of MRDCA took into account two modalities of text and image. Within the MRDCA, RoBERTa was employed for extracting text features of fake news, and DenseNet was employed for extracting images features of fake news. On the basis of co-attention mechanism, text and image features were fused together for detecting and classifying various forms of fake news. As a result, the main contributions of this paper are: Firstly,based upon 6-way classification labels in the Fakeddit, the multimodal model of MRDCA was designed for fine-grained fake news detection. Secondly, the co-attention mechanism was utilized for dynamically learning and capturing information interaction between text and image modal features for better accomplishment of fine-grained fake news classification task. Thirdly, multiple experiments on the benchmark dataset of Fakeddit validated the prominence of MRDCA.

The remainder of this paper is organized as follows. Section 2 reviews relevant studies what have been conducted for fake news detection. Section 3 presents the methodology consisting of three modules of RoBERTa, DenseNet, and co-attention mechanism. Section 4 presents the dataset, experimental settings, performance metrics, and comparison experiments. Section 5 describes the experimental results and further discusses performance between MRDCA and other models with unimodal and multimodal features. Section 6 summarizes major contributions and research limitations.

2 Related work

A news article involves information from multiple perspectives such as headline, content, image, and metadata. Any alterations made to these different perspectives will give rise to deceptive behavior that is commonly termed as fake news (Lazer et al., 2018; Singhal et al., 2019). Different categorizes of fake news have been explored with the objective of obtaining insights on how fake news can be efficiently and quickly identified (Bondielli & Marcelloni, 2019; Davoudi et al., 2022; Kürüm et al., 2018), in order to alleviate its negative influences on the entire society. Kirchknopf et al. (2021) summarized various approaches for detecting fake news and, divided them into two major groups of unimodal approaches and multimodal approaches. The first group of approaches for fake news detection was strongly dependent on unimodal features such as text and image. As for text modality, fake news detection was mainly focused on statistical and semantic features of text content (Braşoveanu & Andonie, 2019; Liao et al., 2021). Based upon counting diverse symbols (e.g. punctuation, emotion, and hyperlink) in news texts, Castillo et al. (2011) developed a model for determining the authenticity of news. Rashkin et al. (2017) incorporated semantic information and features into the fake news detection model that was combined with long short-term memory (LSTM) network. Aiming to enhance fact analysis in news content, Pan et al. (2018) proposed novel approaches including the B-TransE model to detecting fake news through knowledge graphs. The results indicated that some approaches had over 0.80 F1-scores. These approaches aforementioned heavily relied on hand-crafted features, and it was not efficient and wasted a large number of resources.

With the continuous development of deep learning, researchers were inclined to construct detection models based on deep learning techniques, in order for automatic end-to-end detection of fake news. Wang (2017) developed a hybrid convolutional neural network (CNN) model and categorized fake news into six classes according to various integrations of metadata. Embedding LSTM, depth LSTM, linguistic inquiry and word count (LIWC) CNN, and n-gram CNN were incorporated into an ensemble learning framework for discerning fake news (Huang & Chen, 2020). The algorithm of self-adaptive harmony search was utilized to determine the weights of ensemble models. On the basis of a set of explicit and latent features extracted from textual information, Zhang et al. (2019) designed an automatic fake news credibility inference model—FAKEDETECTOR, which was a deep diffusive network model to learn the representations of news articles, creators and subjects simultaneously. The results of extensive experiments demonstrated FAKEDETECTOR outperformed better than other approaches. Another deep attention model based upon recurrent neural network (RNN) was constructed to identify textual rumors from the social media platform of Twitter (Chen et al., 2018). Samadi et al. (2021) combined contextualized text representation with deep neural classification for fake news detection. Comparative experiments were implemented to evaluate performance of different combinations of pre-trained models and neural classifiers. Dai et al. (2021) propose an aspect-level sentiment analysis task combining syntactic information with RoBERTa model. The results indicated that the induced tree from fine-tuned RoBERTa (FT-RoBERTa) outperforms the parser-provided tree.

A variety of commercial tools have been designed and created for editing images, making it extremely convenient to forge fake images. Academics and practitioners have proposed multiple approaches to detecting malicious image manipulation (Fast et al., 2018; Mangal & Sharma, 2020). Owing to the advantage of interpretability, domain-specific approaches paid attention to isolating physical cues within an image (Huh et al., 2018), which had proven to be very powerful in identifying resampling artifacts, misaligned blocks and other cues (Huang et al., 2010; Liu, 2011). More recent studies moved away from domain-specific approaches to machine or deep learning approaches that were concentrated on employing end-to-end learning techniques to discern false information from images (Huh et al., 2018). In order for the detection of image-to-image translation, both state-of-the-art methods and deep CNN model were used for developing image forgery detectors (Marra et al., 2018). A fully convolutional network (FCN)-based approach was utilized for localizing image splicing attacks. Salloum et al. (2018) evaluated the single-task FCN (SFCN) trained on the surface label, and the multi-task FCN (MFCN) which adopted two output branches for multi-task learning. Zhang et al. (2019) proposed a novel multiple feature reweight DenseNet (MFR-DenseNet) architecture to complete the image classification task. The MFR-DenseNet improves the representation power of the DenseNet by adaptively recalibrating the channel-wise feature responses and explicitly modeling the interdependencies between the features of different convolutional layers.

Despite the fact that these approaches based upon unimodal features performed well in fake news detection, short and informal nature of social media data become a challenge in extracting false information (Singhal et al., 2019). False information can span multiple modalities inclusive of text, image, video, and so on. Recently there is a growing interest in using multimodal misinformation for identifying fake news. Given that CNNs are good at image forensics, Simonyan and Zisserman (2014) adopted pre-trained deep CNN models for feature extraction of fake news images, which were fused with textual modal information. Jin et al. (2017) established a recurrent neural network with attention mechanism (att-RNN) to fuse multimodal features of fake news. Wang et al. (2018) proposed an event adversarial network-based multimodal fake news detection model. Within the model, VGG19 and TextCNN were used to extract image and text features, respectively. An event discriminator was added to fully learn event-independent features for promoting generalization performance of the model. In order to conquer the limitation of current approaches with the disadvantage of learning a shared representation of multimodal features, an end-to-end network of multimodal variational autoencoder (MVAE) with a binary classifier was established for the task of fake news detection (Khattar et al., 2019).

These aforementioned multimodal features-based approaches were able to offer prominent performance. It denoted that visual and other modal information was contributive to semantic enhancement of textual modal information. However, interrelationships between various features were overlooked across multiple modalities. Extensive attention was paid to the task of binary classification of fake news. The forms and degrees of fake news were indeed complicated, binary classification tasks could result in loss of much information contained in fake news. For most existing approaches, multimodal features were fused on the basis of concatenation mechanism that could not capture information interaction among text, image and other modalities. Aiming to overcome these shortcomings, the model of MRDCA was proposed for fine-grained classification of fake news including six classes: true, satire/parody, misleading content, imposter content, false connection, and manipulated content. Two modalities of text and image were taken into consideration through multimodal feature fusion mechanism of co-attention.

3 Methodology

The developed multimodal model of MRDCA for the detection of fine-grained fake news was composed of three modules. The first module aimed to utilize the language model of robustly optimized BERT pre-training approach (RoBERTa) for extracting contextual text features. The second one employed dense convolutional network (DenseNet) as the image feature extraction module. The third one was a multimodal fusion module which combined text features with image features to obtain multimodal feature vectors on the basis of co-attention mechanism. Figure  illustrates the framework of the developed multimodal model integrating RoBERTa with DenseNet through co-attention.

Fig. 1
figure 1

Framework of MRDCA

3.1 RoBERTa model

As a language representation model, the BERT has achieved state-of-the-art accuracy on a variety of natural language processing and understanding tasks (Devlin et al., 2018; Islam et al., 2022). Previously language representation models were only able to read text input sequentially from left to right or from right to left. However, they were not able to conduct both simultaneously. The model of BERT is distinguished, since it is designed with the objective of reading from both directions at the same time (Lin et al., 2022; Song et al., 2021). Based upon this bidirectional capability, BERT is pre-trained on two different, but related tasks of masked language modeling (MLM) and next sentence prediction (NSP) (Devlin et al., 2018). MLM training aims to hide a word in a sentence, and then have the program predict what word has been masked according to the context of the masked word. NSP training is to have the program predict whether two given sentences have a logical, sequential connection or whether the connection is simply random.

The BERT is on the basis of transformer, a deep learning model in which each output element is connected to each input element, and the weightings between them are dynamically computed based upon their connections. Transformer architecture consists of two parts of encoder and decoder. It is an encoder-decoder network which deploys self-attention on the side of encoder and attention on the side of decoder (Van Aken et al., 2019; Croce et al., 2020). The BERT is basically an encoder stack of transformer architecture. Each encoder layer in BERT model is demonstrated in Fig. . Every text sequence was prepended with the special token “CLS”. The final representation for each input sequence was obtained through summing up its token embedding, segment embedding and position embedding (Devlin et al., 2018). Rather than a static sinusoidal function in transformer, position embedding in BERT was the learned position embedding. This indeed increased learning effort in pre-training stage, but extra efforts could be almost excluded comparing to number of the trainable parameters in transformer encoder.

Fig. 2
figure 2

Each encoder layer in BERT

The multi-head self-attention is an ensemble of multiple attention modules sharing the same formulation (Devlin et al., 2018; Si et al., 2020). Given a text sequence represented as the embedding matrix \(Y \in {\mathbb {R}}^{(L+1) \times D}\), L denotes the length of the text sequence, and D denotes the token and position embedding dimensions. The first row in Y corresponds to the special token ‘CLS’, and there are \(L+1\) rows in Y. As for a single-attention head, tokens of ‘CLS’ and the input sequence are mapped into the key, query and value triplets, represented as matrices \(K \in {\mathbb {R}}^{(L+1) \times D}\), \(Q \in {\mathbb {R}}^{(L+1) \times D}\), and \(V \in {\mathbb {R}}^{(L+1) \times D}\).

$$\begin{aligned} K=Y W_K, Q=Y W_Q, V=Y W_V. \end{aligned}$$
(1)

where \(\left\{ W_K, W_Q, W_V\right\} \in {\mathbb {R}}^{(L+1) \times D}\) are learnable parameters for the key, query and value of self-attention. On the basis of three matrices of K, Q and V, the attention mechanism can be calculated as

$$\begin{aligned} O_i={\text {attention}}_i(K, Q, V)={\text {softmax}}\left( \frac{Q K^T}{\sqrt{d}}\right) V \in {\mathbb {R}}^{(L+1) \times d} \end{aligned}$$
(2)

where \(i=1,2, \ldots , h, h\) denotes the number of attention heads, and softmax \((\cdot )\) denotes the softmax function applied row-wise. The multi-head self-attention is defined by concatenating and projecting the representation of each head as

$$\begin{aligned} O=\left[ O_1, O_2, \ldots , O_h\right] W \in {\mathbb {R}}^{(L+1) \times D} \end{aligned}$$
(3)

where \([\cdot , \cdot ]\) denotes column-wise concatenation, and W denotes a learnable projection matrix. Based upon the multi-head self-attention, the position-wise feed forward network in Fig. 2 consisting of two fully connected layers is represented as

$$\begin{aligned} F F N(y)=\max \left( 0, u W_{1}+b_{1}\right) W_{2}+b_{2} \end{aligned}$$
(4)

Where max\((0,\cdot )\) denotes the standard ReLU activation function, {\(W_1\), \(W_2\), \(b_1\), \(b_2\)} are learnable parameters, and u is the layer normalized residual block u=LayerNorm (y+o). Where y (rows of Y) and o (rows of O) are the inputs and outputs of the multi-head self-attention based on Equations (13). The LayerNorm(*) operator is applied according to Ba et al. (2016).

Although the model of BERT has been shown to be a promising language model, it also received scrutiny on its training and pre-processing (Delobelle et al., 2020). Liu et al. (2019) developed the model of RoBERTa as an improved recipe for training BERT models, but RoBERTa maintains the same model architecture as BERT. The first modification is the removal of the objective of NSP that was designed for performance promotion on the downstream tasks in BERT. The BERT consists of two tasks of MLM and NSP. Comparing to BERT, RoBERTa trains the model on longer sequences and makes the masked language modeling more difficult. Therefore, the task of MLM overlaps the topic prediction task, and hence the task of NSP becomes redundant. The second modification is to pre-train with sequences of at most 512 tokens. The RoBERTa does not randomly inject short sequences, and it does not train with a reduced sequence length for the first 90% of updates. The full-length sequences are merely trained in RoBERTa.

The third modification of RoBERTa is to train the model longer with larger batch size over more data. The next modification is to dynamically change the masking pattern. The BERT relies on randomly masking and predicting tokens. The original BERT implementation performs masking once during data preprocessing, which results in a single static mask. In RoBERTa, the masking is implemented during training. Therefore, each time a sentence is incorporated in a mini-batch, it gets its masking done, and therefore the number of potentially different masked versions of each sentence is not bounded like in BERT. It becomes crucial when to pre-train for more steps or with larger datasets. Another modification is that RoBERTa adopts a larger byte-level BPE (Byte-Pair Encoding). It is a hybrid between character- and word-level representations that allows handling the large vocabularies common in natural language corpora. Based on these modifications, the RoBERTa model obtains better text representation capability than BERT (Liu et al., 2019). Therefore, this study employed RoBERTa for extracting text features from fine-grained fake news, and the vector corresponding to the output ’CLS’ was used as the feature vector of news text. The final extracted text feature dimension of Roberta model is \({R}_{T}\), its feature dimention is batch-size * d_model , where the dimension of d_model is 768.

3.2 DenseNet model

Convolutional neural networks (CNNs) such as GoogLenet, VGG-19, Incepetion and ResNet have become the dominant machine learning approaches in the field of computer vision [56],(Szegedy et al., 2017). In a standard CNN model, an image is considered as an input, and it is then passed through the network to get an output predicted label in a way where the forward pass is pretty straightforward. Each convolutional layer except the first one which takes in the input image takes in the output of the previous convolutional layer, and produces an output feature map that is then passed to next convolutional layer. For L layers, there are L direct connections, one between each layer and its subsequent layer.

The architecture of DenseNet is about modifying the architecture of a standard CNN, as depicted in Fig. . Each layer in the DenseNet is connected to every other layer, hence the name densely connected convolutional network. For m layers, there are \(M(M+1)/2\) direct connections. For each layer, the feature maps of all the preceding layers are used as inputs, and its own feature maps are used as inputs for subsequent layers. Given a DenseNet architecture with M layers, each layer performs a non-linear transformation Hi. The output of the ith layer of the architecture is represented as xi, and the input image is represented as \(x_0\). Corresponding dense connectivity can be represented as

$$\begin{aligned} {\varvec{x}}_{i}=H_{i}\left( \left[ \textrm{x}_{0}, \textrm{x}_{1}, \textrm{x}_{2}, \ldots , \textrm{x}_{\textrm{i}-1}\right] \right) \end{aligned}$$
(5)
Fig. 3
figure 3

Architecture of DenseNet

In a DenseNet architecture, every layer is essentially connected to every other layer. It is the main idea that is extremely powerful. The input of a layer inside DenseNet is the concatenation of feature maps from previous layers. DenseNets have several compelling advantages, for instance, they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters (Huang et al., 2017). The DenseNet network was originally intended for the validation task of 1000 classes, based upon the Imagenet dataset. A fully connected layer was attached after the DenseNet network to extract the image features. Firstly, the picture is pre-processed, including operations such as cropping, resize, and rotation. The picture is then converted into an RGB three-channel vector V for computational processing, and its vector dimension is batch-sizie * 3 * 224 * 224. Secondly, the vector V is processed in the Densenet network. Finally, a linear layer is connected, which is mapped to 768 dimensions, and the output dimension of \({R}_{V}\) is batch-size * d_model . The model weights adopted in this study were based upon the results of DenseNet161 pre-trained on the Imagenet dataset, and were continuously trained on the news dataset for parameter update.

3.3 Co-attention mechanism

The mechanism of co-attention as shown in Fig.  is used for fusing text features and image features from the news dataset. In the detection of multimodal fine-grained fake news, there are certain relationships between texts and images. Rather than directly splicing text and image features, co-attention mechanism was deployed for modeling dense interactions between text and image features by exchanging their information. It produced an attention-pooled feature for one modality (e.g. text) conditioned on another modality (e.g. image). Texts and images were connected by calculating the similarity between text-image pairs of features. If \({R}_{T}\) came from text and \({R}_{V}\) came from the attached image, the attention value calculated using \({R}_{T}\) and \({R}_{V}\) could be used as a measure of the similarity between the text and image, and then weights the image. Given the text features as \({R}_{T} \in {\mathbb {R}}^{d \times T} \) and the image features as \({R}_{V} \in {\mathbb {R}}^{d \times N}\), the affinity matrix \(C \in {\mathbb {R}}^{T \times N} \) was calculated as

$$\begin{aligned} C=\tanh \left( {R}_{T}^{T} W_{b} {R}_{V}\right) \end{aligned}$$
(6)

where \(W_{b} \in {\mathbb {R}}^{d \times d}\) is the parameter matrix. After the calculation of affinity matrix, it could be considered as a feature to learn as well as predict the attention weights of text and image by using Eqs. (7) and (8).

$$\begin{aligned} H^{t}= & {} \tanh \left( W_{t} {R}_{T}+\left( W_{v} {R}_{V}\right) C^{T}\right) , H^{v}=\tanh \left( W_{v} {R}_{V}+\left( W_{t} {R}_{T}\right) C\right) \end{aligned}$$
(7)
$$\begin{aligned} \alpha ^{t}= & {} \text {softmax}\left( W_{h t}^{T} H^{t}\right) , \alpha ^{v}=\text {softmax}\left( W_{h v}^{T} H^{v}\right) \end{aligned}$$
(8)

where \(W_{t} \in {\mathbb {R}}^{k \times d}\),\(W_{v} \in {\mathbb {R}}^{k \times d}\),\(W_{h t} \in {\mathbb {R}}^{k}\),and \(W_{h v} \in {\mathbb {R}}^{k}\) are parameter matrices,\(\alpha ^{t} \in {\mathbb {R}}^{T}\) denotes the probability of attention weight for each text word, and \(\alpha ^{v} \in {\mathbb {R}}^{N}\) denotes the probability of attention weight for each image region. On the basis of attention weights, the two vectors of text and image through attention were represented as

$$\begin{aligned} \widehat{{\varvec{t}}}=\sum _{i=1}^{I} \alpha _{i}^{t} t_{i}, {\widehat{v}}=\sum _{n=1}^{N} \alpha _{n}^{v} v_{n} \end{aligned}$$
(9)

Later the two vectors were spliced (see Eq. 10) for obtaining \({R}_{F}\), which was finally input into the classifier of softmax for the classification of fine-grained fake news.

$$\begin{aligned} {R}_{F}={\widehat{v}} \oplus \widehat{{t}} \end{aligned}$$
(10)
Fig. 4
figure 4

Architecture of co-attention

4 Empirical study

4.1 Dataset

The collection of adequate data for fake news analysis from the internet is deemed one of the major problems in the field (Bondielli & Marcelloni, 2019; Nakamura et al., 2019). Due to multiple difficulties in gathering relevant data, there are not many datasets which are publicly available for fake news research and detection. Comparing to other existing datasets, the Fakeddit is a multimodal benchmark dataset providing a large number of multimodal samples with multiple labels for various levels of fine-grained classification (Nakamura et al., 2019). A total of 1,063,106 samples have been gathered in the dataset of Fakeddit that incorporates 2-way, 3-way, and 6-way classification labels with comment data and metadata. It offers a large breadth of novel features which can be utilized for a variety of applications. Therefore, the Fakeddit was adopted as the source of data for this research.

All the samples in the Fakeddit were obtained from the Reddit that claims to be the front page of the internet (https://www.redditinc.com/). It is a website where a community of registered users submits content. Whether you pay attention to breaking news, sports, TV fan theories, or a never-ending stream of the internet’s cutest animals, there is a possible community on Reddit for you. Reddit is basically a large group of forums where users are able to post submissions on various specialized forums, often called “subreddits” (Anderson, 2015). Its format resembles a traditional bulletin board system, allowing users to post messages and links to other websites and comments on each other’s posts. Reddit is one of the top 20 websites in the world by traffic (Nakamura et al., 2019). Those samples in the Fakeddit were collected from 22 different subreddits. Three labels are provided for every sample, which allows training for 2-way, 3-way, and 6-way classification (Kirchknopf et al., 2021). Instead of merely doing a simple binary or trinary classification, the 6-way classification was created to categorize fake news into different types, which is beneficial for demonstrating the degree and variation of fake news. Therefore, it was employed for fine-grained fake news detection. Table  explains the 6-way classification labels (i.e. true, satire/parody, misleading content, imposter content, false connection, and manipulated content) in the Fakeddit. Some examples with 6-way classification labels are provided in Fig. .

Table 1 Labels of 6-way classification in the Fakeddit
Fig. 5
figure 5

Examples with classification labels

Owing to the restriction of experimental conditions, 30,053 samples inclusive of both texts and images were randomly selected from the Fakeddit. This dataset maintained the similar data distribution of 6-way classification labels to the original dataset in the Fakeddit. Table  illustrates the data distribution of 6-way classification labels in the training set, validation set, and test set, respectively.

Table 2 Data distribution in datasets

4.2 Experimental settings

PyTorch is a free and open-source library that is mainly adopted for computer vision, deep learning, and natural language processing applications. Different from other popular deep learning frameworks which utilize static computation graphs, PyTorch employs dynamic computation. It allows greater flexibility in establishing complex architectures (Chen et al., 2019; Subramanian, 2018). This research thus deployed the modules in PyTorch for specifying the model of MRDCA. Considering that there are a large number of parameters in both modules of RoBERTa and DenseNet, a cloud server was utilized for experiments. The training of MRDCA model was implemented on Windows machine using the processor of Intel Xeon Gold 6240C with 36 cores, 72 threads, and 24.75 MB cache. The graphic card adopted was NVIDIA GeForce RTX 3090 with the graphic memory about 24 GB. Table  shows these configuration parameters of the machine.

Table 3 Configuration parameters of the machine

The model of MRDCA integrated RoBERTa with DenseNet on the basis of feature fusion mechanism of co-attention. Given that DenseNet161 is the largest model in the DenseNet group with a size around 100MB (Mai et al., 2020), we adopted the DenseNet161 inclusive of four dense blocks. Early stopping is a form of regularization based on choosing when to stop running an iterative algorithm (Caponnetto & Yao, 2010; Raskutti et al., 2014). The strategy of early stopping was applied for improving model accuracy. Accordingly the model was trained for 40 epochs. When there were 5000 steps and the results were no longer optimized, we stopped training the model. The AdamW optimizer had learning rate of 3e−5 and batch size of 32. The loss function used by the model is the cross entropy loss. As can be seen from Fig. , the modal has reached the optimal effect after iterating for 4000 steps (continue training, and the loss of the validation set will no longer decrease). Table illustrates corresponding hyperparameters of the model. The loss curve during model training is shown in Fig. 6.

Table 4 Hyperparameters of the model
Fig. 6
figure 6

Loss curve

4.3 Performance metrics

According to confusion matrix, four common indicators including accuracy, precision, recall, and \(F_1\) score were employed for evaluating the performance of MRDCA model. As the most intuitive performance indicator (Zhou et al., 2021), accuracy is defined as the ratio of correct predictions out of all observations by a model. The indicator of accuracy demonstrates how often we can expect the model will correctly predict an outcome out of the total number of times it made predictions. Precision measures the proportion of positively predicted labels that are actually correct. It is a useful indicator of the success of prediction when the classes are very imbalanced. Also known as sensitivity or specificity, recall represents the model’s ability to correctly predict the positives out of actual positives. Precision is usually adopted in conjunction with the recall to trade-off false positives and false negatives. The indicator of \(F_1\) score represents the model’s performance as a function of both precision and recall. It is a well-established classification performance indicator which conveys a balance between precision and recall. In comparison to the indicator of accuracy, \(F_1\) score is more informative and transparent in a problem that exhibits a class imbalance (Hunt et al., 2022).

As for a 2-way classification problem, the confusion matrix consists of four outcomes of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) (Raschka, 2014; Veropoulos et al., 1999). True positives are the outcomes which the model correctly predicts as positive. True negative measures the extent to which the model correctly predicts the negative class. False positives are the observations where the actual ones are negative, and false negatives are the observations where the actual ones are positive (Zhou et al., 2021). Using the values of TP, FP, TN and FN, four indicators of accuracy, precision, recall, and F1 score can be calculated as the following Eqs. (1114) (Grandini et al., 2020).

$$\begin{aligned} \text { Accuracy }= & {} \frac{T P+T N}{T P+F P+F N+T N} \end{aligned}$$
(11)
$$\begin{aligned} \text { Precision }= & {} \frac{T P}{T P+F P} \end{aligned}$$
(12)
$$\begin{aligned} \text { Recall }= & {} \frac{T P}{T P+F N} \end{aligned}$$
(13)
$$\begin{aligned} F_{1}= & {} \frac{2 \times \text { Precision } \times \text { Recall }}{\text { Precision }+\text { Recall }}=\frac{2 \times T P}{2 \times T P+F N+F P} \end{aligned}$$
(14)

The 6-way classification in this research can be decomposed into distinct 2-way classification problems, the indicator of precision which is defined in Eq. (11) can also be calculated separately for each class. The macro average precision measure (\(Macro\_Avg\_Precision\)) is achieved simply through the arithmetic mean of precisions for single classes, as shown in Eq. (15) (Grandini et al., 2020). In addition, the macro average recall measure (\(Macro\_Avg\_Recall\)) and macro average \(F_1\) score measure (\(Macro\_Avg\_F_1\)) can be calculated as the following Eqs. (16) and 17, respectively.

$$\begin{aligned} \text {Macro}\_\text {Avg}\_\text {Precision}= & {} \frac{\sum _{i=1}^{n} \text { Precision }_{i}}{n} \end{aligned}$$
(15)
$$\begin{aligned} \text { Macro}\_\text {Avg}\_\text {Recall }= & {} \frac{\sum _{i=1}^{n} \text { Recall }_{i}}{n} \end{aligned}$$
(16)
$$\begin{aligned} \text {Macro}\_\text {Avg}\_{F}_{1}= & {} \frac{\sum _{i=1}^{n} F_{1 t}}{n}=\frac{\sum _{i=1}^{n} \text { Recall }_{i}}{n} \end{aligned}$$
(17)

where n denotes the number of classes, \(Precision_i\) denotes the value of precision for the class \(i \in \{1,2, \ldots , n\}\), \(Recall_i\) denotes the value of recall for the class \(i \in \{1,2, \ldots , n\}\), and \({F_{1i}}\) denotes the value of \(F_1\) for the class \(i \in \{1,2, \ldots , n\}\). The approaches of macro average have the objective of computing an overall mean of different indicators of precision, recall and \(F_1\) score. They are not associated with class size, since classes with different size are equally weighted at the numerator. It is indicated that the largest class has the same influence on these indicators as small classes have. The obtained indicators evaluate the model from a class standpoint. High values of indicators demonstrate that the model has good performance on all the classes, whereas low values of indicators refer to predicted classes with poor performance (Grandini et al., 2020; Zubiaga et al., 2018).

4.4 Comparison experiments

In order to verify the effectiveness of MRDCA model for fake news detection, multiple experiments were designed for performance comparison. Two models of BERT and RoBERTa were utilized for extracting features from textual data. Three models of VGG19, ResNet50, and DenseNet161 were utilized for extracting features from image data. The mechanism of concatenation was employed for model fusion in multiple comparison experiments. The first group of comparison experiments adopted a unimodal approach for fake news detection on the basis of text or image samples, as indicated in Table . Table demonstrates the second group of comparison experiments with a multimodal approach for fake news detection based on both text and image samples.

Table 5 Comparison experiments with a unimodal approach
Table 6 Comparison experiments with a multimodal approach

5 Results and analysis

Table  demonstrates the experimental results of fake news detection on the basis of various selections of modalities. With respect to single text modality, the model of RoBERTa had a higher value for all the indicators of accuracy (83.63%), macro average precision (84.89%), macro average recall (82.60%), and macro average \(F_1\) score (83.63%), comparing to the model of BERT. It denoted that RoBERTa performed better than BERT in detecting fake news based upon textual feature extraction. This was a further proof of RoBERTa’s outstanding ability in semantic processing and comprehension. As to single image modality, the three models of VGG19, ResNet50, and DenseNet161 did not have a good performance, because the values of all four indicators were lower than 65%. Among the three models, ResNet50 achieved the highest value of accuracy at 63.37%. The values of macro average recall and macro average F1 score were approximate to 50%. This indicated the three CNN models of VGG19, ResNet50, and DenseNet161 were not able to extract adequate features from image samples for the identification of fake news. It could be explained by the fact that registered users of Reddit usually submitted or forwarded images irrelevant to an individual topic. These images thus did not provide enough valid information for detecting fake news.

Table 7 Performance of different models in detecting fake newsl approach

The experimental results illustrated that some models based on a multimodal approach had lower values of four indicators of accuracy, macro average precision, macro average recall, and macro average \(F_1\) score, comparing to the models with single text modality. For instance, Fig.  shows the comparison of performance in fake news detection between BERT and BERT + VGG19. As displayed, the values of all four indicators for the single modality model of BERT were larger than them for the multimodal model of BERT + VGG19. Therefore, the former one had a better performance than the latter one. It was indicated that image features deteriorated the role of text features in discerning fake news. Other comparisons such as between RoBERTa and RoBERTa + ResNet50 presented similar results. In spite of slight higher values of accuracy and macro average recall for the multimodal model of RoBERTa + ResNet50, the single modality model of RoBERTa had slight higher values of macro average precision and macro average \(F_1\) score. These comparisons demonstrated that image modal features could not always have a positive influence on enhancing the performance in fake news detection.

Fig. 7
figure 7

Performance comparison between BERT and BERT + VGG19

On the contrary, text modal features were indeed contributive to reinforcing the role of image modal features in determining fake news. In contrast to the single image modality, all models with a multimodal approach had a larger value in four indicators (i.e. accuracy, macro average precision, macro average recall, and macro average \(F_1\) score). Figure  illustrates an example that compares the performance in fake news detection between ResNet50 and RoBERTa + ResNet50. After the addition of text features to image features, there was a significant promotion of performance in determining fake news. As denoted in Fig. 8, the value of every indicator for the multimodal model of RoBERTa + ResNet50 has increased by approximately 20%, comparing to the unimodal approach of ResNet50. Therefore, text modal features had a consistently positive influence on enhancing the performance in fake news detection. One possible explanation was that texts had the advantage in clearly and exactly conveying information, in comparison with images.

Fig. 8
figure 8

Performance comparison between ResNet50 and RoBERTa + ResNet50

As to the multimodal models based on the fusion mechanism of concatenation, RoBERTa + DenseNet161 had larger values of accuracy, macro average precision, and macro average F1 score than other models. This indicated image features extracted through the model of DenseNet161 positively strengthened the role of text features extracted through the model of RoBERTa in the identification of fake news. Therefore, RoBERTa and DenseNet161 were selected for developing an integrated approach with multimodal features in the research. After comparing RoBERTa + DenseNet161 with MRDCA, we found that the latter model had a better performance in four indicators. The fusion mechanism of co-attention had played a more important role than concatenation in fake news detection on the basis of integrating text features with image features. In contrast with RoBERTa + DenseNet161, the model of MRDCA had an increase of 2.59% in accuracy, 0.17% in macro average precision, 6.17% in macro average recall, and 3.16% macro average \(F_1\) score. Among all the models in Table 7, the MRDCA performs best in detecting fake news with 6-way classification.

Table  illustrates the performance in discerning fake news for each individual class, on the basis of MRDCA model. As indicated, the model of MRDCA had a high accuracy rate at 88.14%. It denoted that only 11.86% of test samples were not detected correctly. Among all six classes of fake news, the MRDCA had the best performance in identifying the class of manipulated content, owing to the highest value of precision at 94.90%, recall at 94.69%, and \(F_1\) score at 94.80%. Also the MRDCA performed well in detecting false connection and true, because there were relatively higher values of precision, recall and \(F_1\) score for the detection of the two classes. It was found that the indicator of precision had a higher value than recall for the three classes. This meant there were very few false positives, and the multimodal model of MRDCA was very strict in the criteria for classifying manipulated content, false connection and true as positive.

Table 8 Performance of MRDCA in fake news detection for individual class

On the contrary, the MRDCA had the poorest performance in identifying the class of misleading content, owing to the lowest value of precision at 76.52%, recall at 83.56%, and \(F_1\) score at 79.88%. It could be explained by the fact that this category of fake news was composed of misleading information with the intention to deceive the audience. This intention increased the difficulty in detecting misleading content. The multimodal model of MRDCA did not perform well in classifying imposter content and satire/parody either, since their values of precision, recall and \(F_1\) score were smaller than corresponding macro average values. Therefore, the task of categorizing samples into the three classes of misleading content, imposter content and satire/parody was extremely challenging, and there was much room for improvement. As shown in Table 8, the indicator of recall had a higher value than precision for the three classes of fake news, especially for the class of misleading content. Relatively lower values of \(F_1\) score could be attributed to corresponding low values of precision. A high recall value indicated that there were very few false negatives and the model of MRDCA was more permissive in the criteria for detecting misleading content, imposter content and satire/parody as positive.

6 Conclusions

In consideration of multiple aspects of misinformation in fake news, this research developed a multimodal model of MRDCA for detecting fake news. Main research contributions to the body of knowledge are threefold as follows. Firstly, based upon 6-way classification labels in the Fakeddit, the multimodal model of MRDCA was designed form fine-grained fake news detection. Within the MRDCA, RoBERTa and DenseNet161 were incorporated through feature fusion mechanism of co-attention. RoBERTa and DenseNet161 were deemed the text feature extractor and the image feature extractor, respectively. Secondly, the co-attention mechanism was utilized for dynamically learning and capturing information interaction between text and image modal features. It had the advantage of feature fusion of text and image modalities, in order for better accomplishment of fine-grained fake news classification task. Thirdly, multiple experiments on the benchmark dataset of Fakeddit validated the prominence of MRDCA. Experimental results demonstrated that the multimodal model MRDCA outperformed unimodal approaches and other multimodal approaches. Fine-grained fake news detection had the contribution to more comprehension on the degree of fakeness, which was considered as the foundation of further investigating characteristics, motivations, and spreading patterns of an individual class of fake news.

In spite of these substantial contributions, some limitations ought to be acknowledged in this research and still need to be refined in the future work. Firstly, online news articles are time-sensitive, and fake news can be created with a real-time pattern. The benchmark dataset of Fakeddit had the timespan of 10 years from 2008 to 2019 (Nakamura et al., 2019). There ought to be trending topics or events of fake news in which the masses were interested during this period. We can separate the samples from Fakeddit into multiple periods for fake news detection through the MRDCA. Further more, distinct patterns of fake news among these periods can be determined. Secondly, within the multimodal approach of MRDCA, two modal features of text and image were extracted by using RoBERTa and DenseNet161, respectively. Features from other modalities such as audio, video, and metadata have potential contributions to semantic enhancement of news content. Therefore, we can try to fuse these features with text and image features to detect fake news to improve the effectiveness of fake news detection in the future. Thirdly, the MRDCA outperformed unimodal approaches or other multimodal approaches in fake news detection with 6-way classification. There was the unbalanced performance in detecting different classes of fake news. The experimental results showed that the multimodal model of MRDCA performed better in detecting manipulated content, false connection and true than in detecting imposter content, misleading content, and satire/parody. Extensive attention should be paid to the detection of imposter content, misleading content, and satire/parody in the future studies.