MRDCA: a multimodal approach for fine-grained fake news detection through integration of RoBERTa and DenseNet based upon fusion mechanism of co-attention

Qian, Lingfei; Xu, Ruipeng; Zhou, Zhipeng

doi:10.1007/s10479-022-05154-9

MRDCA: a multimodal approach for fine-grained fake news detection through integration of RoBERTa and DenseNet based upon fusion mechanism of co-attention

Original Research
Published: 26 December 2022

(2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Annals of Operations Research Aims and scope Submit manuscript

MRDCA: a multimodal approach for fine-grained fake news detection through integration of RoBERTa and DenseNet based upon fusion mechanism of co-attention

Download PDF

510 Accesses
4 Citations
Explore all metrics

Abstract

Being widely produced for misleading and convincing public community with biased information, various fake news has a significantly negative influence on the society as a whole. In order for effective detection of fine-grained fake news, this study developed a multimodal approach integrating RoBERTa with DenseNet through fusion mechanism of co-attention (MRDCA). RoBERTa was employed for extracting text features, and DenseNet was employed for extracting image features. The co-attention mechanism had the advantage of dynamically learning and capturing information interaction between text and image modal features. Based upon the multimodal fine-grained fake news dataset, the model of MRDCA had a higher value for all the indicators of accuracy (88.14%), macro average precision (87.16%), macro average recall (87.94%), and macro average F1 score (87.51%), comparing to unimodal approaches and other multimodal approaches through feature fusion of concatenation. More specifically, there was the unbalanced performance for MRDCA in detecting different classes of fake news. Experimental results demonstrated that the MRDCA performed better in identifying manipulated content, false connection and true than in identifying imposter content, misleading content and satire/parody. Therefore, the task of classifying samples into misleading content, imposter content and satire/parody was extremely challenging. There ought to be much room for performance promotion in detecting the three classes of fake news in future.

A Fake News Detection Method Based on a Multimodal Cooperative Attention Network

Multimodal Co-training for Fake News Identification Using Attention-aware Fusion

An effective strategy for multi-modal fake news detection

Article 24 February 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Along with the advancement of new technology and proliferation of internet, various types of information are being disseminated with an unprecedented speed (Kumar et al., 2021; Rubin, 2017). Traditional media are interwoven with online social media, and patterns of information dissemination have been completely changed (Zhou et al., 2021). The masses do not only receive information passively, but also actively participate in producing and sharing information (Fast et al., 2018; Patacconi & Vikander, 2015). However, unverified information that may be false spreads just like accurate information through the internet, thus possibly going viral and having a significant influence on public opinions and decisions (Bondielli & Marcelloni, 2019; Yuan et al., 2018; Zhang & Ghorbani, 2020). Misinformation (Fallis, 2015) generated and disseminated on social network platforms is called misinformation in social networks. Fake news is one of the representatives. Rumor is a social phenomenon in which news content or subjective opinions and comments are spread through social networks (Bondielli & Marcelloni, 2019; Kumar & Shah, 2018). The authenticity of the rumor content is unproven, but when it is proved to be wrong, the rumor becomes misinformation. Fake news is deemed one of the most popular forms of false or unverified information, and it is necessary to determine fake news as early as possible for decreasing its detrimental impacts on the society as a whole.

One potential reason for broad dissemination of fake news is short of adequate knowledge and competency among the masses. Public community is unaware of the credibility of information sources and the authenticity of the news that is read (Castillo et al., 2011; Gupta et al., 2014). Another reason is that automated fact-checking methods are not available. Despite the fact that multiple websites inclusive of Politifact, AltNews, and so on have made great efforts on fake news detection, most of them are dependent on manual approaches which are time consuming. It is too slow to avoid initial dissemination of different fake news (Singhal et al., 2019). Previous studies on fake news detection were mainly concentrated on extracting false information from text content of news (Bondielli & Marcelloni, 2019; Faustini & Covões, 2020; Reddy et al., 2020; Reis et al., 2019). However, fake news is commonly produced by using diverse forms such as image and text. There is a necessity of multimodal approaches for detecting fake news. After the extraction of multimodal features from text, image, audio, video and so on, feature fusion becomes very crucial for effective detection of false information from news.

Fake news detection was concentrated on determining whether a piece of news was true or false (Goldani et al., 2021; Helmstetter & Paulheim, 2018; Singhal et al., 2019), which was considered to be a binary classification problem. As a matter of fact, false information within various news is complicated. There are multiple categories of fake news such as misleading content, imposter content, fabricated content, false connection, false context, satire or parody, and manipulated content (Nakamura et al., 2019). Fine-grained object detection of fake news is conducive to better understanding of the degree of fakeness (Fung et al., 2021; Goldani et al., 2021). It is helpful for researchers to further explore and identify characteristics of a specific category of fake news. Many pieces of news with varying degree of fakeness may be created with distinct motivations, and will disseminate through different paths or patterns.

Therefore, this research was motivated to establish a multimodal approach integrating RoBERTa with DenseNet through co-attention mechanism (MRDCA) with the purpose for fake news detection based upon fine-grained categorization. The model of MRDCA took into account two modalities of text and image. Within the MRDCA, RoBERTa was employed for extracting text features of fake news, and DenseNet was employed for extracting images features of fake news. On the basis of co-attention mechanism, text and image features were fused together for detecting and classifying various forms of fake news. As a result, the main contributions of this paper are: Firstly,based upon 6-way classification labels in the Fakeddit, the multimodal model of MRDCA was designed for fine-grained fake news detection. Secondly, the co-attention mechanism was utilized for dynamically learning and capturing information interaction between text and image modal features for better accomplishment of fine-grained fake news classification task. Thirdly, multiple experiments on the benchmark dataset of Fakeddit validated the prominence of MRDCA.

The remainder of this paper is organized as follows. Section 2 reviews relevant studies what have been conducted for fake news detection. Section 3 presents the methodology consisting of three modules of RoBERTa, DenseNet, and co-attention mechanism. Section 4 presents the dataset, experimental settings, performance metrics, and comparison experiments. Section 5 describes the experimental results and further discusses performance between MRDCA and other models with unimodal and multimodal features. Section 6 summarizes major contributions and research limitations.

2 Related work

A news article involves information from multiple perspectives such as headline, content, image, and metadata. Any alterations made to these different perspectives will give rise to deceptive behavior that is commonly termed as fake news (Lazer et al., 2018; Singhal et al., 2019). Different categorizes of fake news have been explored with the objective of obtaining insights on how fake news can be efficiently and quickly identified (Bondielli & Marcelloni, 2019; Davoudi et al., 2022; Kürüm et al., 2018), in order to alleviate its negative influences on the entire society. Kirchknopf et al. (2021) summarized various approaches for detecting fake news and, divided them into two major groups of unimodal approaches and multimodal approaches. The first group of approaches for fake news detection was strongly dependent on unimodal features such as text and image. As for text modality, fake news detection was mainly focused on statistical and semantic features of text content (Braşoveanu & Andonie, 2019; Liao et al., 2021). Based upon counting diverse symbols (e.g. punctuation, emotion, and hyperlink) in news texts, Castillo et al. (2011) developed a model for determining the authenticity of news. Rashkin et al. (2017) incorporated semantic information and features into the fake news detection model that was combined with long short-term memory (LSTM) network. Aiming to enhance fact analysis in news content, Pan et al. (2018) proposed novel approaches including the B-TransE model to detecting fake news through knowledge graphs. The results indicated that some approaches had over 0.80 F1-scores. These approaches aforementioned heavily relied on hand-crafted features, and it was not efficient and wasted a large number of resources.

With the continuous development of deep learning, researchers were inclined to construct detection models based on deep learning techniques, in order for automatic end-to-end detection of fake news. Wang (2017) developed a hybrid convolutional neural network (CNN) model and categorized fake news into six classes according to various integrations of metadata. Embedding LSTM, depth LSTM, linguistic inquiry and word count (LIWC) CNN, and n-gram CNN were incorporated into an ensemble learning framework for discerning fake news (Huang & Chen, 2020). The algorithm of self-adaptive harmony search was utilized to determine the weights of ensemble models. On the basis of a set of explicit and latent features extracted from textual information, Zhang et al. (2019) designed an automatic fake news credibility inference model—FAKEDETECTOR, which was a deep diffusive network model to learn the representations of news articles, creators and subjects simultaneously. The results of extensive experiments demonstrated FAKEDETECTOR outperformed better than other approaches. Another deep attention model based upon recurrent neural network (RNN) was constructed to identify textual rumors from the social media platform of Twitter (Chen et al., 2018). Samadi et al. (2021) combined contextualized text representation with deep neural classification for fake news detection. Comparative experiments were implemented to evaluate performance of different combinations of pre-trained models and neural classifiers. Dai et al. (2021) propose an aspect-level sentiment analysis task combining syntactic information with RoBERTa model. The results indicated that the induced tree from fine-tuned RoBERTa (FT-RoBERTa) outperforms the parser-provided tree.

A variety of commercial tools have been designed and created for editing images, making it extremely convenient to forge fake images. Academics and practitioners have proposed multiple approaches to detecting malicious image manipulation (Fast et al., 2018; Mangal & Sharma, 2020). Owing to the advantage of interpretability, domain-specific approaches paid attention to isolating physical cues within an image (Huh et al., 2018), which had proven to be very powerful in identifying resampling artifacts, misaligned blocks and other cues (Huang et al., 2010; Liu, 2011). More recent studies moved away from domain-specific approaches to machine or deep learning approaches that were concentrated on employing end-to-end learning techniques to discern false information from images (Huh et al., 2018). In order for the detection of image-to-image translation, both state-of-the-art methods and deep CNN model were used for developing image forgery detectors (Marra et al., 2018). A fully convolutional network (FCN)-based approach was utilized for localizing image splicing attacks. Salloum et al. (2018) evaluated the single-task FCN (SFCN) trained on the surface label, and the multi-task FCN (MFCN) which adopted two output branches for multi-task learning. Zhang et al. (2019) proposed a novel multiple feature reweight DenseNet (MFR-DenseNet) architecture to complete the image classification task. The MFR-DenseNet improves the representation power of the DenseNet by adaptively recalibrating the channel-wise feature responses and explicitly modeling the interdependencies between the features of different convolutional layers.

Despite the fact that these approaches based upon unimodal features performed well in fake news detection, short and informal nature of social media data become a challenge in extracting false information (Singhal et al., 2019). False information can span multiple modalities inclusive of text, image, video, and so on. Recently there is a growing interest in using multimodal misinformation for identifying fake news. Given that CNNs are good at image forensics, Simonyan and Zisserman (2014) adopted pre-trained deep CNN models for feature extraction of fake news images, which were fused with textual modal information. Jin et al. (2017) established a recurrent neural network with attention mechanism (att-RNN) to fuse multimodal features of fake news. Wang et al. (2018) proposed an event adversarial network-based multimodal fake news detection model. Within the model, VGG19 and TextCNN were used to extract image and text features, respectively. An event discriminator was added to fully learn event-independent features for promoting generalization performance of the model. In order to conquer the limitation of current approaches with the disadvantage of learning a shared representation of multimodal features, an end-to-end network of multimodal variational autoencoder (MVAE) with a binary classifier was established for the task of fake news detection (Khattar et al., 2019).

These aforementioned multimodal features-based approaches were able to offer prominent performance. It denoted that visual and other modal information was contributive to semantic enhancement of textual modal information. However, interrelationships between various features were overlooked across multiple modalities. Extensive attention was paid to the task of binary classification of fake news. The forms and degrees of fake news were indeed complicated, binary classification tasks could result in loss of much information contained in fake news. For most existing approaches, multimodal features were fused on the basis of concatenation mechanism that could not capture information interaction among text, image and other modalities. Aiming to overcome these shortcomings, the model of MRDCA was proposed for fine-grained classification of fake news including six classes: true, satire/parody, misleading content, imposter content, false connection, and manipulated content. Two modalities of text and image were taken into consideration through multimodal feature fusion mechanism of co-attention.

3 Methodology

The developed multimodal model of MRDCA for the detection of fine-grained fake news was composed of three modules. The first module aimed to utilize the language model of robustly optimized BERT pre-training approach (RoBERTa) for extracting contextual text features. The second one employed dense convolutional network (DenseNet) as the image feature extraction module. The third one was a multimodal fusion module which combined text features with image features to obtain multimodal feature vectors on the basis of co-attention mechanism. Figure illustrates the framework of the developed multimodal model integrating RoBERTa with DenseNet through co-attention.

3.1 RoBERTa model

As a language representation model, the BERT has achieved state-of-the-art accuracy on a variety of natural language processing and understanding tasks (Devlin et al., 2018; Islam et al., 2022). Previously language representation models were only able to read text input sequentially from left to right or from right to left. However, they were not able to conduct both simultaneously. The model of BERT is distinguished, since it is designed with the objective of reading from both directions at the same time (Lin et al., 2022; Song et al., 2021). Based upon this bidirectional capability, BERT is pre-trained on two different, but related tasks of masked language modeling (MLM) and next sentence prediction (NSP) (Devlin et al., 2018). MLM training aims to hide a word in a sentence, and then have the program predict what word has been masked according to the context of the masked word. NSP training is to have the program predict whether two given sentences have a logical, sequential connection or whether the connection is simply random.

The BERT is on the basis of transformer, a deep learning model in which each output element is connected to each input element, and the weightings between them are dynamically computed based upon their connections. Transformer architecture consists of two parts of encoder and decoder. It is an encoder-decoder network which deploys self-attention on the side of encoder and attention on the side of decoder (Van Aken et al., 2019; Croce et al., 2020). The BERT is basically an encoder stack of transformer architecture. Each encoder layer in BERT model is demonstrated in Fig. . Every text sequence was prepended with the special token “CLS”. The final representation for each input sequence was obtained through summing up its token embedding, segment embedding and position embedding (Devlin et al., 2018). Rather than a static sinusoidal function in transformer, position embedding in BERT was the learned position embedding. This indeed increased learning effort in pre-training stage, but extra efforts could be almost excluded comparing to number of the trainable parameters in transformer encoder.

The multi-head self-attention is an ensemble of multiple attention modules sharing the same formulation (Devlin et al., 2018; Si et al., 2020). Given a text sequence represented as the embedding matrix $Y \in {\mathbb {R}}^{(L+1) \times D}$, L denotes the length of the text sequence, and D denotes the token and position embedding dimensions. The first row in Y corresponds to the special token ‘CLS’, and there are $L+1$ rows in Y. As for a single-attention head, tokens of ‘CLS’ and the input sequence are mapped into the key, query and value triplets, represented as matrices $K \in {\mathbb {R}}^{(L+1) \times D}$, $Q \in {\mathbb {R}}^{(L+1) \times D}$, and $V \in {\mathbb {R}}^{(L+1) \times D}$.

$$\begin{aligned} K=Y W_K, Q=Y W_Q, V=Y W_V. \end{aligned}$$

(1)

where $\left\{ W_K, W_Q, W_V\right\} \in {\mathbb {R}}^{(L+1) \times D}$ are learnable parameters for the key, query and value of self-attention. On the basis of three matrices of K, Q and V, the attention mechanism can be calculated as

$$\begin{aligned} O_i={\text {attention}}_i(K, Q, V)={\text {softmax}}\left( \frac{Q K^T}{\sqrt{d}}\right) V \in {\mathbb {R}}^{(L+1) \times d} \end{aligned}$$

(2)

where $i=1,2, \ldots , h, h$ denotes the number of attention heads, and softmax $(\cdot )$ denotes the softmax function applied row-wise. The multi-head self-attention is defined by concatenating and projecting the representation of each head as

$$\begin{aligned} O=\left[ O_1, O_2, \ldots , O_h\right] W \in {\mathbb {R}}^{(L+1) \times D} \end{aligned}$$

(3)

where $[\cdot , \cdot ]$ denotes column-wise concatenation, and W denotes a learnable projection matrix. Based upon the multi-head self-attention, the position-wise feed forward network in Fig. 2 consisting of two fully connected layers is represented as

$$\begin{aligned} F F N(y)=\max \left( 0, u W_{1}+b_{1}\right) W_{2}+b_{2} \end{aligned}$$

(4)

Where max$(0,\cdot )$ denotes the standard ReLU activation function, {$W_1$, $W_2$, $b_1$, $b_2$} are learnable parameters, and u is the layer normalized residual block u=LayerNorm (y+o). Where y (rows of Y) and o (rows of O) are the inputs and outputs of the multi-head self-attention based on Equations (1–3). The LayerNorm(*) operator is applied according to Ba et al. (2016).

Although the model of BERT has been shown to be a promising language model, it also received scrutiny on its training and pre-processing (Delobelle et al., 2020). Liu et al. (2019) developed the model of RoBERTa as an improved recipe for training BERT models, but RoBERTa maintains the same model architecture as BERT. The first modification is the removal of the objective of NSP that was designed for performance promotion on the downstream tasks in BERT. The BERT consists of two tasks of MLM and NSP. Comparing to BERT, RoBERTa trains the model on longer sequences and makes the masked language modeling more difficult. Therefore, the task of MLM overlaps the topic prediction task, and hence the task of NSP becomes redundant. The second modification is to pre-train with sequences of at most 512 tokens. The RoBERTa does not randomly inject short sequences, and it does not train with a reduced sequence length for the first 90% of updates. The full-length sequences are merely trained in RoBERTa.

The third modification of RoBERTa is to train the model longer with larger batch size over more data. The next modification is to dynamically change the masking pattern. The BERT relies on randomly masking and predicting tokens. The original BERT implementation performs masking once during data preprocessing, which results in a single static mask. In RoBERTa, the masking is implemented during training. Therefore, each time a sentence is incorporated in a mini-batch, it gets its masking done, and therefore the number of potentially different masked versions of each sentence is not bounded like in BERT. It becomes crucial when to pre-train for more steps or with larger datasets. Another modification is that RoBERTa adopts a larger byte-level BPE (Byte-Pair Encoding). It is a hybrid between character- and word-level representations that allows handling the large vocabularies common in natural language corpora. Based on these modifications, the RoBERTa model obtains better text representation capability than BERT (Liu et al., 2019). Therefore, this study employed RoBERTa for extracting text features from fine-grained fake news, and the vector corresponding to the output ’CLS’ was used as the feature vector of news text. The final extracted text feature dimension of Roberta model is ${R}_{T}$, its feature dimention is batch-size * d_model , where the dimension of d_model is 768.

3.2 DenseNet model

Convolutional neural networks (CNNs) such as GoogLenet, VGG-19, Incepetion and ResNet have become the dominant machine learning approaches in the field of computer vision [56],(Szegedy et al., 2017). In a standard CNN model, an image is considered as an input, and it is then passed through the network to get an output predicted label in a way where the forward pass is pretty straightforward. Each convolutional layer except the first one which takes in the input image takes in the output of the previous convolutional layer, and produces an output feature map that is then passed to next convolutional layer. For L layers, there are L direct connections, one between each layer and its subsequent layer.

The architecture of DenseNet is about modifying the architecture of a standard CNN, as depicted in Fig. . Each layer in the DenseNet is connected to every other layer, hence the name densely connected convolutional network. For m layers, there are $M(M+1)/2$ direct connections. For each layer, the feature maps of all the preceding layers are used as inputs, and its own feature maps are used as inputs for subsequent layers. Given a DenseNet architecture with M layers, each layer performs a non-linear transformation Hi. The output of the ith layer of the architecture is represented as xi, and the input image is represented as $x_0$. Corresponding dense connectivity can be represented as

$$\begin{aligned} {\varvec{x}}_{i}=H_{i}\left( \left[ \textrm{x}_{0}, \textrm{x}_{1}, \textrm{x}_{2}, \ldots , \textrm{x}_{\textrm{i}-1}\right] \right) \end{aligned}$$

(5)

In a DenseNet architecture, every layer is essentially connected to every other layer. It is the main idea that is extremely powerful. The input of a layer inside DenseNet is the concatenation of feature maps from previous layers. DenseNets have several compelling advantages, for instance, they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters (Huang et al., 2017). The DenseNet network was originally intended for the validation task of 1000 classes, based upon the Imagenet dataset. A fully connected layer was attached after the DenseNet network to extract the image features. Firstly, the picture is pre-processed, including operations such as cropping, resize, and rotation. The picture is then converted into an RGB three-channel vector V for computational processing, and its vector dimension is batch-sizie * 3 * 224 * 224. Secondly, the vector V is processed in the Densenet network. Finally, a linear layer is connected, which is mapped to 768 dimensions, and the output dimension of ${R}_{V}$ is batch-size * d_model . The model weights adopted in this study were based upon the results of DenseNet161 pre-trained on the Imagenet dataset, and were continuously trained on the news dataset for parameter update.

3.3 Co-attention mechanism

The mechanism of co-attention as shown in Fig. is used for fusing text features and image features from the news dataset. In the detection of multimodal fine-grained fake news, there are certain relationships between texts and images. Rather than directly splicing text and image features, co-attention mechanism was deployed for modeling dense interactions between text and image features by exchanging their information. It produced an attention-pooled feature for one modality (e.g. text) conditioned on another modality (e.g. image). Texts and images were connected by calculating the similarity between text-image pairs of features. If ${R}_{T}$ came from text and ${R}_{V}$ came from the attached image, the attention value calculated using ${R}_{T}$ and ${R}_{V}$ could be used as a measure of the similarity between the text and image, and then weights the image. Given the text features as ${R}_{T} \in {\mathbb {R}}^{d \times T} $ and the image features as ${R}_{V} \in {\mathbb {R}}^{d \times N}$, the affinity matrix $C \in {\mathbb {R}}^{T \times N} $ was calculated as

$$\begin{aligned} C=\tanh \left( {R}_{T}^{T} W_{b} {R}_{V}\right) \end{aligned}$$

(6)

where $W_{b} \in {\mathbb {R}}^{d \times d}$ is the parameter matrix. After the calculation of affinity matrix, it could be considered as a feature to learn as well as predict the attention weights of text and image by using Eqs. (7) and (8).

$$\begin{aligned} H^{t}= & {} \tanh \left( W_{t} {R}_{T}+\left( W_{v} {R}_{V}\right) C^{T}\right) , H^{v}=\tanh \left( W_{v} {R}_{V}+\left( W_{t} {R}_{T}\right) C\right) \end{aligned}$$

(7)

$$\begin{aligned} \alpha ^{t}= & {} \text {softmax}\left( W_{h t}^{T} H^{t}\right) , \alpha ^{v}=\text {softmax}\left( W_{h v}^{T} H^{v}\right) \end{aligned}$$

(8)

where $W_{t} \in {\mathbb {R}}^{k \times d}$,$W_{v} \in {\mathbb {R}}^{k \times d}$,$W_{h t} \in {\mathbb {R}}^{k}$,and $W_{h v} \in {\mathbb {R}}^{k}$ are parameter matrices,$\alpha ^{t} \in {\mathbb {R}}^{T}$ denotes the probability of attention weight for each text word, and $\alpha ^{v} \in {\mathbb {R}}^{N}$ denotes the probability of attention weight for each image region. On the basis of attention weights, the two vectors of text and image through attention were represented as

$$\begin{aligned} \widehat{{\varvec{t}}}=\sum _{i=1}^{I} \alpha _{i}^{t} t_{i}, {\widehat{v}}=\sum _{n=1}^{N} \alpha _{n}^{v} v_{n} \end{aligned}$$

(9)

Later the two vectors were spliced (see Eq. 10) for obtaining ${R}_{F}$, which was finally input into the classifier of softmax for the classification of fine-grained fake news.

$$\begin{aligned} {R}_{F}={\widehat{v}} \oplus \widehat{{t}} \end{aligned}$$

(10)

4 Empirical study

4.1 Dataset

The collection of adequate data for fake news analysis from the internet is deemed one of the major problems in the field (Bondielli & Marcelloni, 2019; Nakamura et al., 2019). Due to multiple difficulties in gathering relevant data, there are not many datasets which are publicly available for fake news research and detection. Comparing to other existing datasets, the Fakeddit is a multimodal benchmark dataset providing a large number of multimodal samples with multiple labels for various levels of fine-grained classification (Nakamura et al., 2019). A total of 1,063,106 samples have been gathered in the dataset of Fakeddit that incorporates 2-way, 3-way, and 6-way classification labels with comment data and metadata. It offers a large breadth of novel features which can be utilized for a variety of applications. Therefore, the Fakeddit was adopted as the source of data for this research.

All the samples in the Fakeddit were obtained from the Reddit that claims to be the front page of the internet (https://www.redditinc.com/). It is a website where a community of registered users submits content. Whether you pay attention to breaking news, sports, TV fan theories, or a never-ending stream of the internet’s cutest animals, there is a possible community on Reddit for you. Reddit is basically a large group of forums where users are able to post submissions on various specialized forums, often called “subreddits” (Anderson, 2015). Its format resembles a traditional bulletin board system, allowing users to post messages and links to other websites and comments on each other’s posts. Reddit is one of the top 20 websites in the world by traffic (Nakamura et al., 2019). Those samples in the Fakeddit were collected from 22 different subreddits. Three labels are provided for every sample, which allows training for 2-way, 3-way, and 6-way classification (Kirchknopf et al., 2021). Instead of merely doing a simple binary or trinary classification, the 6-way classification was created to categorize fake news into different types, which is beneficial for demonstrating the degree and variation of fake news. Therefore, it was employed for fine-grained fake news detection. Table explains the 6-way classification labels (i.e. true, satire/parody, misleading content, imposter content, false connection, and manipulated content) in the Fakeddit. Some examples with 6-way classification labels are provided in Fig. .

Table 1 Labels of 6-way classification in the Fakeddit

Full size table

Owing to the restriction of experimental conditions, 30,053 samples inclusive of both texts and images were randomly selected from the Fakeddit. This dataset maintained the similar data distribution of 6-way classification labels to the original dataset in the Fakeddit. Table illustrates the data distribution of 6-way classification labels in the training set, validation set, and test set, respectively.

Table 2 Data distribution in datasets

Full size table

4.2 Experimental settings

PyTorch is a free and open-source library that is mainly adopted for computer vision, deep learning, and natural language processing applications. Different from other popular deep learning frameworks which utilize static computation graphs, PyTorch employs dynamic computation. It allows greater flexibility in establishing complex architectures (Chen et al., 2019; Subramanian, 2018). This research thus deployed the modules in PyTorch for specifying the model of MRDCA. Considering that there are a large number of parameters in both modules of RoBERTa and DenseNet, a cloud server was utilized for experiments. The training of MRDCA model was implemented on Windows machine using the processor of Intel Xeon Gold 6240C with 36 cores, 72 threads, and 24.75 MB cache. The graphic card adopted was NVIDIA GeForce RTX 3090 with the graphic memory about 24 GB. Table shows these configuration parameters of the machine.

Table 3 Configuration parameters of the machine

Full size table

The model of MRDCA integrated RoBERTa with DenseNet on the basis of feature fusion mechanism of co-attention. Given that DenseNet161 is the largest model in the DenseNet group with a size around 100MB (Mai et al., 2020), we adopted the DenseNet161 inclusive of four dense blocks. Early stopping is a form of regularization based on choosing when to stop running an iterative algorithm (Caponnetto & Yao, 2010; Raskutti et al., 2014). The strategy of early stopping was applied for improving model accuracy. Accordingly the model was trained for 40 epochs. When there were 5000 steps and the results were no longer optimized, we stopped training the model. The AdamW optimizer had learning rate of 3e−5 and batch size of 32. The loss function used by the model is the cross entropy loss. As can be seen from Fig. , the modal has reached the optimal effect after iterating for 4000 steps (continue training, and the loss of the validation set will no longer decrease). Table illustrates corresponding hyperparameters of the model. The loss curve during model training is shown in Fig. 6.

Table 4 Hyperparameters of the model

Full size table

4.3 Performance metrics

According to confusion matrix, four common indicators including accuracy, precision, recall, and $F_1$ score were employed for evaluating the performance of MRDCA model. As the most intuitive performance indicator (Zhou et al., 2021), accuracy is defined as the ratio of correct predictions out of all observations by a model. The indicator of accuracy demonstrates how often we can expect the model will correctly predict an outcome out of the total number of times it made predictions. Precision measures the proportion of positively predicted labels that are actually correct. It is a useful indicator of the success of prediction when the classes are very imbalanced. Also known as sensitivity or specificity, recall represents the model’s ability to correctly predict the positives out of actual positives. Precision is usually adopted in conjunction with the recall to trade-off false positives and false negatives. The indicator of $F_1$ score represents the model’s performance as a function of both precision and recall. It is a well-established classification performance indicator which conveys a balance between precision and recall. In comparison to the indicator of accuracy, $F_1$ score is more informative and transparent in a problem that exhibits a class imbalance (Hunt et al., 2022).

As for a 2-way classification problem, the confusion matrix consists of four outcomes of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) (Raschka, 2014; Veropoulos et al., 1999). True positives are the outcomes which the model correctly predicts as positive. True negative measures the extent to which the model correctly predicts the negative class. False positives are the observations where the actual ones are negative, and false negatives are the observations where the actual ones are positive (Zhou et al., 2021). Using the values of TP, FP, TN and FN, four indicators of accuracy, precision, recall, and F1 score can be calculated as the following Eqs. (11–14) (Grandini et al., 2020).

$$\begin{aligned} \text { Accuracy }= & {} \frac{T P+T N}{T P+F P+F N+T N} \end{aligned}$$

(11)

$$\begin{aligned} \text { Precision }= & {} \frac{T P}{T P+F P} \end{aligned}$$

(12)

$$\begin{aligned} \text { Recall }= & {} \frac{T P}{T P+F N} \end{aligned}$$

(13)

$$\begin{aligned} F_{1}= & {} \frac{2 \times \text { Precision } \times \text { Recall }}{\text { Precision }+\text { Recall }}=\frac{2 \times T P}{2 \times T P+F N+F P} \end{aligned}$$

(14)

The 6-way classification in this research can be decomposed into distinct 2-way classification problems, the indicator of precision which is defined in Eq. (11) can also be calculated separately for each class. The macro average precision measure ($Macro\_Avg\_Precision$) is achieved simply through the arithmetic mean of precisions for single classes, as shown in Eq. (15) (Grandini et al., 2020). In addition, the macro average recall measure ($Macro\_Avg\_Recall$) and macro average $F_1$ score measure ($Macro\_Avg\_F_1$) can be calculated as the following Eqs. (16) and 17, respectively.

$$\begin{aligned} \text {Macro}\_\text {Avg}\_\text {Precision}= & {} \frac{\sum _{i=1}^{n} \text { Precision }_{i}}{n} \end{aligned}$$

(15)

$$\begin{aligned} \text { Macro}\_\text {Avg}\_\text {Recall }= & {} \frac{\sum _{i=1}^{n} \text { Recall }_{i}}{n} \end{aligned}$$

(16)

$$\begin{aligned} \text {Macro}\_\text {Avg}\_{F}_{1}= & {} \frac{\sum _{i=1}^{n} F_{1 t}}{n}=\frac{\sum _{i=1}^{n} \text { Recall }_{i}}{n} \end{aligned}$$

(17)

where n denotes the number of classes, $Precision_i$ denotes the value of precision for the class $i \in \{1,2, \ldots , n\}$, $Recall_i$ denotes the value of recall for the class $i \in \{1,2, \ldots , n\}$, and ${F_{1i}}$ denotes the value of $F_1$ for the class $i \in \{1,2, \ldots , n\}$. The approaches of macro average have the objective of computing an overall mean of different indicators of precision, recall and $F_1$ score. They are not associated with class size, since classes with different size are equally weighted at the numerator. It is indicated that the largest class has the same influence on these indicators as small classes have. The obtained indicators evaluate the model from a class standpoint. High values of indicators demonstrate that the model has good performance on all the classes, whereas low values of indicators refer to predicted classes with poor performance (Grandini et al., 2020; Zubiaga et al., 2018).

4.4 Comparison experiments

In order to verify the effectiveness of MRDCA model for fake news detection, multiple experiments were designed for performance comparison. Two models of BERT and RoBERTa were utilized for extracting features from textual data. Three models of VGG19, ResNet50, and DenseNet161 were utilized for extracting features from image data. The mechanism of concatenation was employed for model fusion in multiple comparison experiments. The first group of comparison experiments adopted a unimodal approach for fake news detection on the basis of text or image samples, as indicated in Table . Table demonstrates the second group of comparison experiments with a multimodal approach for fake news detection based on both text and image samples.

Table 5 Comparison experiments with a unimodal approach

Full size table

Table 6 Comparison experiments with a multimodal approach

Full size table

5 Results and analysis

Table demonstrates the experimental results of fake news detection on the basis of various selections of modalities. With respect to single text modality, the model of RoBERTa had a higher value for all the indicators of accuracy (83.63%), macro average precision (84.89%), macro average recall (82.60%), and macro average $F_1$ score (83.63%), comparing to the model of BERT. It denoted that RoBERTa performed better than BERT in detecting fake news based upon textual feature extraction. This was a further proof of RoBERTa’s outstanding ability in semantic processing and comprehension. As to single image modality, the three models of VGG19, ResNet50, and DenseNet161 did not have a good performance, because the values of all four indicators were lower than 65%. Among the three models, ResNet50 achieved the highest value of accuracy at 63.37%. The values of macro average recall and macro average F1 score were approximate to 50%. This indicated the three CNN models of VGG19, ResNet50, and DenseNet161 were not able to extract adequate features from image samples for the identification of fake news. It could be explained by the fact that registered users of Reddit usually submitted or forwarded images irrelevant to an individual topic. These images thus did not provide enough valid information for detecting fake news.

Table 7 Performance of different models in detecting fake newsl approach

Full size table

The experimental results illustrated that some models based on a multimodal approach had lower values of four indicators of accuracy, macro average precision, macro average recall, and macro average $F_1$ score, comparing to the models with single text modality. For instance, Fig. shows the comparison of performance in fake news detection between BERT and BERT + VGG19. As displayed, the values of all four indicators for the single modality model of BERT were larger than them for the multimodal model of BERT + VGG19. Therefore, the former one had a better performance than the latter one. It was indicated that image features deteriorated the role of text features in discerning fake news. Other comparisons such as between RoBERTa and RoBERTa + ResNet50 presented similar results. In spite of slight higher values of accuracy and macro average recall for the multimodal model of RoBERTa + ResNet50, the single modality model of RoBERTa had slight higher values of macro average precision and macro average $F_1$ score. These comparisons demonstrated that image modal features could not always have a positive influence on enhancing the performance in fake news detection.

On the contrary, text modal features were indeed contributive to reinforcing the role of image modal features in determining fake news. In contrast to the single image modality, all models with a multimodal approach had a larger value in four indicators (i.e. accuracy, macro average precision, macro average recall, and macro average $F_1$ score). Figure illustrates an example that compares the performance in fake news detection between ResNet50 and RoBERTa + ResNet50. After the addition of text features to image features, there was a significant promotion of performance in determining fake news. As denoted in Fig. 8, the value of every indicator for the multimodal model of RoBERTa + ResNet50 has increased by approximately 20%, comparing to the unimodal approach of ResNet50. Therefore, text modal features had a consistently positive influence on enhancing the performance in fake news detection. One possible explanation was that texts had the advantage in clearly and exactly conveying information, in comparison with images.

As to the multimodal models based on the fusion mechanism of concatenation, RoBERTa + DenseNet161 had larger values of accuracy, macro average precision, and macro average F1 score than other models. This indicated image features extracted through the model of DenseNet161 positively strengthened the role of text features extracted through the model of RoBERTa in the identification of fake news. Therefore, RoBERTa and DenseNet161 were selected for developing an integrated approach with multimodal features in the research. After comparing RoBERTa + DenseNet161 with MRDCA, we found that the latter model had a better performance in four indicators. The fusion mechanism of co-attention had played a more important role than concatenation in fake news detection on the basis of integrating text features with image features. In contrast with RoBERTa + DenseNet161, the model of MRDCA had an increase of 2.59% in accuracy, 0.17% in macro average precision, 6.17% in macro average recall, and 3.16% macro average $F_1$ score. Among all the models in Table 7, the MRDCA performs best in detecting fake news with 6-way classification.

Table illustrates the performance in discerning fake news for each individual class, on the basis of MRDCA model. As indicated, the model of MRDCA had a high accuracy rate at 88.14%. It denoted that only 11.86% of test samples were not detected correctly. Among all six classes of fake news, the MRDCA had the best performance in identifying the class of manipulated content, owing to the highest value of precision at 94.90%, recall at 94.69%, and $F_1$ score at 94.80%. Also the MRDCA performed well in detecting false connection and true, because there were relatively higher values of precision, recall and $F_1$ score for the detection of the two classes. It was found that the indicator of precision had a higher value than recall for the three classes. This meant there were very few false positives, and the multimodal model of MRDCA was very strict in the criteria for classifying manipulated content, false connection and true as positive.

Table 8 Performance of MRDCA in fake news detection for individual class

Full size table

On the contrary, the MRDCA had the poorest performance in identifying the class of misleading content, owing to the lowest value of precision at 76.52%, recall at 83.56%, and $F_1$ score at 79.88%. It could be explained by the fact that this category of fake news was composed of misleading information with the intention to deceive the audience. This intention increased the difficulty in detecting misleading content. The multimodal model of MRDCA did not perform well in classifying imposter content and satire/parody either, since their values of precision, recall and $F_1$ score were smaller than corresponding macro average values. Therefore, the task of categorizing samples into the three classes of misleading content, imposter content and satire/parody was extremely challenging, and there was much room for improvement. As shown in Table 8, the indicator of recall had a higher value than precision for the three classes of fake news, especially for the class of misleading content. Relatively lower values of $F_1$ score could be attributed to corresponding low values of precision. A high recall value indicated that there were very few false negatives and the model of MRDCA was more permissive in the criteria for detecting misleading content, imposter content and satire/parody as positive.

6 Conclusions

In consideration of multiple aspects of misinformation in fake news, this research developed a multimodal model of MRDCA for detecting fake news. Main research contributions to the body of knowledge are threefold as follows. Firstly, based upon 6-way classification labels in the Fakeddit, the multimodal model of MRDCA was designed form fine-grained fake news detection. Within the MRDCA, RoBERTa and DenseNet161 were incorporated through feature fusion mechanism of co-attention. RoBERTa and DenseNet161 were deemed the text feature extractor and the image feature extractor, respectively. Secondly, the co-attention mechanism was utilized for dynamically learning and capturing information interaction between text and image modal features. It had the advantage of feature fusion of text and image modalities, in order for better accomplishment of fine-grained fake news classification task. Thirdly, multiple experiments on the benchmark dataset of Fakeddit validated the prominence of MRDCA. Experimental results demonstrated that the multimodal model MRDCA outperformed unimodal approaches and other multimodal approaches. Fine-grained fake news detection had the contribution to more comprehension on the degree of fakeness, which was considered as the foundation of further investigating characteristics, motivations, and spreading patterns of an individual class of fake news.

In spite of these substantial contributions, some limitations ought to be acknowledged in this research and still need to be refined in the future work. Firstly, online news articles are time-sensitive, and fake news can be created with a real-time pattern. The benchmark dataset of Fakeddit had the timespan of 10 years from 2008 to 2019 (Nakamura et al., 2019). There ought to be trending topics or events of fake news in which the masses were interested during this period. We can separate the samples from Fakeddit into multiple periods for fake news detection through the MRDCA. Further more, distinct patterns of fake news among these periods can be determined. Secondly, within the multimodal approach of MRDCA, two modal features of text and image were extracted by using RoBERTa and DenseNet161, respectively. Features from other modalities such as audio, video, and metadata have potential contributions to semantic enhancement of news content. Therefore, we can try to fuse these features with text and image features to detect fake news to improve the effectiveness of fake news detection in the future. Thirdly, the MRDCA outperformed unimodal approaches or other multimodal approaches in fake news detection with 6-way classification. There was the unbalanced performance in detecting different classes of fake news. The experimental results showed that the multimodal model of MRDCA performed better in detecting manipulated content, false connection and true than in detecting imposter content, misleading content, and satire/parody. Extensive attention should be paid to the detection of imposter content, misleading content, and satire/parody in the future studies.

References

Anderson, K. E. (2015). Ask me anything: What is reddit? Library Hi Tech News, 32, 8–11. https://doi.org/10.1108/LHTN-03-2015-0018.
Article Google Scholar
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450. https://doi.org/10.48550/arXiv.1607.06450
Bondielli, A., & Marcelloni, F. (2019). A survey on fake news and rumour detection techniques. Information Sciences, 497, 38–55.
Article Google Scholar
Braşoveanu, A. M., & Andonie, R. (2019). Semantic fake news detection: A machine learning perspective. International Work-Conference on Artificial Neural Networks, 11506, 656–667.
Google Scholar
Caponnetto, A., & Yao, Y. (2010). Cross-validation based adaptation for regularization operators in learning theory. Analysis and Applications, 8(02), 161–183.
Article Google Scholar
Castillo, C., Mendoza, M., & Poblete, B. (2011). Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web, pp. 675–684 https://doi.org/10.1145/1963405.1963500
Chen, T., Li, X., Yin, H., & Zhang, J. (2018). Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection. In Pacific-Asia conference on knowledge discovery and data mining, pp. 40–52.
Chen, K. M., Cofer, E. M., Zhou, J., & Troyanskaya, O. G. (2019). Selene: A pytorch-based deep learning library for sequence data. Nature Methods, 16(4), 315–318.
Article Google Scholar
Croce, D., Castellucci, G., & Basili, R. (2020). Gan-bert: Generative adversarial learning for robust text classification with a bunch of labeled examples. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 2114–2119. Association for Computational Linguistics, Online.
Dai, J., Yan, H., Sun, T., Liu, P., & Qiu, X. (2021). Does syntax matter? A strong baseline for aspect-based sentiment analysis with roberta. arXiv preprint arXiv:2104.04986
Davoudi, M., Moosavi, M. R., & Sadreddini, M. H. (2022). Dss: A hybrid deep model for fake news detection using propagation tree and stance network. Expert Systems with Applications, 198, 116635. https://doi.org/10.1016/j.eswa.2022.116635.
Article Google Scholar
Delobelle, P., Winters, T., & Berendt, B. (2020). Robbert: A dutch roberta-based language model. arXiv preprint arXiv:2001.06286. https://doi.org/10.48550/arXiv.2001.06286
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Fallis, D. (2015). What is disinformation? Library Trends, 63(3), 401–426.
Article Google Scholar
Fast, S. M., Kim, L., Cohn, E. L., Mekaru, S. R., Brownstein, J. S., & Markuzon, N. (2018). Predicting social response to infectious disease outbreaks from internet-based news streams. Annals of Operations Research, 263(1), 551–564.
Article Google Scholar
Faustini, P. H. A., & Covões, T. F. (2020). Fake news detection in multiple platforms and languages. Expert Systems with Applications, 158, 113503.
Article Google Scholar
Fung, Y., Thomas, C., Reddy, R. G., Polisetty, S., Ji, H., Chang, S. F., McKeown, K., Bansal, M., & Sil, A. (2021). Infosurgeon: Cross-media fine-grained information consistency checking for fake news detection. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers), pp. 1683–1698. Association for Computational Linguistics, Online.
Goldani, M. H., Momtazi, S., & Safabakhsh, R. (2021). Detecting fake news with capsule neural networks. Applied Soft Computing, 101, 106991.
Article Google Scholar
Goldani, M. H., Safabakhsh, R., & Momtazi, S. (2021). Convolutional neural network with margin loss for fake news detection. Information Processing & Management, 58(1), 102418.
Article Google Scholar
Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for multi-class classification: An overview. arXiv preprint arXiv:2008.05756. https://doi.org/10.48550/arXiv.2008.05756
Gupta, A., Kumaraguru, P., Castillo, C., & Meier, P. (2014). Tweetcred: Real-time credibility assessment of content on twitter. International Conference on Social Informatics, 8851, 228–243.
Article Google Scholar
Helmstetter, S., & Paulheim, H. (2018). Weakly supervised learning for fake news detection on twitter. In 2018 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp. 274–277.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708.
Huang, Y.-F., & Chen, P.-H. (2020). Fake news detection using an ensemble learning model based on self-adaptive harmony search algorithms. Expert Systems with Applications, 159, 113584.
Article Google Scholar
Huang, F., Huang, J., & Shi, Y. Q. (2010). Detecting double jpeg compression with the same quantization matrix. IEEE Transactions on Information Forensics and Security, 5(4), 848–856.
Article Google Scholar
Huh, M., Liu, A., Owens, A., & Efros, A. A. (2018). Fighting fake news: Image splice detection via learned self-consistency. In Proceedings of the European conference on computer vision (ECCV), pp. 101–117.
Hunt, K., Agarwal, P., & Zhuang, J. (2022). Monitoring misinformation on twitter during crisis events: A machine learning approach. Risk Analysis, 42(8), 1728–1748.
Article Google Scholar
Islam, M. R., Razzak, I., Wang, X., Tilocca, P., & Xu, G. (2022). Natural language interactions enhanced by data visualization to explore insurance claims and manage risk. Annals of Operations Research. https://doi.org/10.1007/s10479-021-04465-7.
Article Google Scholar
Jin, Z., Cao, J., Guo, H., Zhang, Y., & Luo, J. (2017). Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM international conference on multimedia, pp. 795–816 https://doi.org/10.1145/3123266.3123454
Khattar, D., Goud, J.S., Gupta, M., & Varma, V. (2019). Mvae: Multimodal variational autoencoder for fake news detection. In The World Wide web conference, pp. 2915–2921. https://doi.org/10.1145/3308558.3313552
Kirchknopf, A., Slijepčević, D., &Zeppelzauer, M. (2021). Multimodal detection of information disorder from social media. In 2021 International conference on content-based multimedia indexing (CBMI), pp. 1–4.
Kumar, S., & Shah, N. (2018). False information on web and social media: A survey. arXiv preprint arXiv:1804.08559
Kumar, S., Xu, C., Ghildayal, N., Chandra, C., & Yang, M. (2021). Social media effectiveness as a humanitarian response to mitigate influenza epidemic and covid-19 pandemic. Annals of Operations Research. https://doi.org/10.1007/s10479-021-03955-y.
Article Google Scholar
Kürüm, E., Weber, G.-W., & Iyigun, C. (2018). Early warning on stock market bubbles via methods of optimization, clustering and inverse problems. Annals of Operations Research, 260(1), 293–320.
Article Google Scholar
Lazer, D. M., Baum, M. A., Benkler, Y., Berinsky, A. J., Greenhill, K. M., Menczer, F., et al. (2018). The science of fake news. Science, 359(6380), 1094–1096.
Article Google Scholar
Liao, Q., Chai, H., Han, H., Zhang, X., Wang, X., Xia, W., & Ding, Y. (2021). An integrated multi-task model for fake news detection. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2021.3054993.
Article Google Scholar
Lin, S.-Y., Kung, Y.-C., & Leu, F.-Y. (2022). Predictive intelligence in harmful news identification by bert-based ensemble learning model with text sentiment analysis. Information Processing & Management, 59(2), 102872.
Article Google Scholar
Liu, Q. (2011). Detection of misaligned cropping and recompression with the same quantization matrix and relevant forgery. In Proceedings of the 3rd international ACM workshop on multimedia in forensics and intelligence, pp. 25–30 https://doi.org/10.1145/2072521.2072528
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692
Mai, Z., Kim, H., Jeong, J., & Sanner, S. (2020). Batch-level experience replay with review for continual learning. arXiv preprint arXiv:2007.05683. https://doi.org/10.48550/arXiv.2007.05683
Mangal, D., & Sharma, D. K. (2020). Fake news detection with integration of embedded text cues and image features. In 2020 8th international conference on reliability, infocom technologies and optimization (trends and future directions)(ICRITO), pp. 68–72. IEEE, Noida, India. https://doi.org/10.1109/ICRITO48877.2020.9197817
Marra, F., Gragnaniello, D., Cozzolino, D., & Verdoliva, L. (2018). Detection of gan-generated fake images over social networks. In 2018 IEEE conference on multimedia information processing and retrieval (MIPR), pp. 384–389. IEEE, Miami, FL, USA. https://doi.org/10.1109/MIPR.2018.00084
Nakamura, K., Levy, S., & Wang, W. Y. (2019). r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. arXiv preprint arXiv:1911.03854. https://doi.org/10.48550/arXiv.1911.03854
Pan, J. Z., Pavlova, S., Li, C., Li, N., Li, Y., & Liu, J. (2018). Content based fake news detection using knowledge graphs. In International semantic web conference, 11136.
Patacconi, A., & Vikander, N. (2015). A model of public opinion management. Journal of Public Economics, 128, 73–83.
Article Google Scholar
Raschka, S. (2014). An overview of general performance metrics of binary classifier systems. arXiv preprint arXiv:1410.5330, 2–4.
Rashkin, H., Choi, E., Jang, J.Y., Volkova, S., & Choi, Y. (2017). Truth of varying shades: Analyzing language in fake news and political fact-checking. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 2931–2937. Association for Computational Linguistics, Copenhagen, Denmark.
Raskutti, G., Wainwright, M. J., & Yu, B. (2014). Early stopping and non-parametric regression: An optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1), 335–366.
Google Scholar
Reddy, H., Raj, N., Gala, M., & Basava, A. (2020). Text-mining-based fake news detection using ensemble methods. International Journal of Automation and Computing, 17(2), 210–221.
Article Google Scholar
Reis, J. C., Correia, A., Murai, F., Veloso, A., & Benevenuto, F. (2019). Supervised learning for fake news detection. IEEE Intelligent Systems, 34(2), 76–81.
Article Google Scholar
Rubin, R. E. (2017) Foundations of library and information science.
Salloum, R., Ren, Y., & Kuo, C.-C.J. (2018). Image splicing localization using a multi-task fully convolutional network (mfcn). Journal of Visual Communication and Image Representation, 51, 201–209.
Article Google Scholar
Samadi, M., Mousavian, M., & Momtazi, S. (2021). Deep contextualized text representation and learning for fake news detection. Information Processing & Management, 58(6), 102723.
Article Google Scholar
Si, S., Wang, R., Wosik, J., Zhang, H., Dov, D., Wang, G., & Carin, L. (2020). Students need more attention: Bert-based attention model for small data with application to automatic patient message triage. In Machine Learning for Healthcare Conference, pp. 436–456.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556
Singhal, S., Shah, R.R., Chakraborty, T., Kumaraguru, P., & Satoh, S (2019). Spotfake: A multi-modal framework for fake news detection. In 2019 IEEE fifth international conference on multimedia big data (BigMM), pp. 39–47. IEEE, Singapore. https://doi.org/10.1109/BigMM.2019.00-44
Song, D., Ma, S., Sun, Z., Yang, S., & Liao, L. (2021). Kvl-bert: Knowledge enhanced visual-and-linguistic bert for visual commonsense reasoning. Knowledge-Based Systems, 230, 107408.
Article Google Scholar
Subramanian, V. (2018). Deep learning with PyTorch: A practical approach to building neural network models using PyTorch.
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
Van Aken, B., Winter, B., Löser, A., & Gers, F.A. (2019). How does bert answer questions? A layer-wise analysis of transformer representations. In Proceedings of the 28th ACM international conference on information and knowledge management, pp. 1823–1832 https://doi.org/10.1145/3357384.3358028
Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. In Proceedings of the International Joint Conference on AI, 55, 60.
Wang, W. Y. (2017). Liar, liar pants on fire: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 422–426). https://doi.org/10.48550/arXiv.1705.00648
Wang, Y., Ma, F., Jin, Z., Yuan, Y., Xun, G., Jha, K., Su, L., & Gao, J. (2018). Eann: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th Acm Sigkdd International Conference on Knowledge Discovery & Data Mining, pp. 849–857 (2018). https://doi.org/10.1145/3219819.3219903
Yuan, H., Xu, W., Li, Q., & Lau, R. (2018). Topic sentiment mining for sales performance prediction in e-commerce. Annals of Operations Research, 270(1), 553–576.
Article Google Scholar
Zhang, J., Dong, B., Philip, S. Y. (2019). Deep diffusive neural network based fake news detection from heterogeneous social networks. In 2019 IEEE international conference on big data (Big Data), pp. 1259–1266.
Zhang, X., & Ghorbani, A. A. (2020). An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, 57(2), 102025.
Article Google Scholar
Zhang, K., Guo, Y., Wang, X., Yuan, J., & Ding, Q. (2019). Multiple feature reweight densenet for image classification. IEEE Access, 7, 9872–9880.
Article Google Scholar
Zhou, Z., Zhou, X., & Qian, L. (2021). Online public opinion analysis on infrastructure megaprojects: Toward an analytical framework. Journal of Management in Engineering, 37(1), 04020105.
Article Google Scholar
Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., & Procter, R. (2018). Detection and resolution of rumours in social media: A survey. ACM Computing Surveys (CSUR), 51(2), 1–36.
Article Google Scholar

Download references

Funding

The work is supported by the National Social Science Fund of China (Grant No. 21BTQ107) and the National Natural Science Foundation of China (Grant Nos. 71871116 and 72174086).

Author information

Authors and Affiliations

Department of Management Science and Engineering, College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, People’s Republic of China
Lingfei Qian, Ruipeng Xu & Zhipeng Zhou

Authors

Lingfei Qian
View author publications
You can also search for this author in PubMed Google Scholar
Ruipeng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhipeng Zhou.

Ethics declarations

Conflict of interest

All authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qian, L., Xu, R. & Zhou, Z. MRDCA: a multimodal approach for fine-grained fake news detection through integration of RoBERTa and DenseNet based upon fusion mechanism of co-attention. Ann Oper Res (2022). https://doi.org/10.1007/s10479-022-05154-9

Download citation

Accepted: 15 December 2022
Published: 26 December 2022
DOI: https://doi.org/10.1007/s10479-022-05154-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

MRDCA: a multimodal approach for fine-grained fake news detection through integration of RoBERTa and DenseNet based upon fusion mechanism of co-attention

Abstract

Similar content being viewed by others

A Fake News Detection Method Based on a Multimodal Cooperative Attention Network

Multimodal Co-training for Fake News Identification Using Attention-aware Fusion

An effective strategy for multi-modal fake news detection

1 Introduction

2 Related work