Rumor Classification through a Multimodal Fusion Framework and Ensemble Learning

Azri, Abderrazek; Favre, Cécile; Harbi, Nouria; Darmont, Jérôme; Noûs, Camille

doi:10.1007/s10796-022-10315-z

Rumor Classification through a Multimodal Fusion Framework and Ensemble Learning

Published: 03 August 2022

Volume 25, pages 1795–1810, (2023)
Cite this article

Download PDF

Information Systems Frontiers Aims and scope Submit manuscript

Rumor Classification through a Multimodal Fusion Framework and Ensemble Learning

Download PDF

Abderrazek Azri ORCID: orcid.org/0000-0003-3204-5740¹,
Cécile Favre¹,
Nouria Harbi¹,
Jérôme Darmont¹ &
…
Camille Noûs²

3239 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The proliferation of rumors on social media has become a major concern due to its ability to create a devastating impact. Manually assessing the veracity of social media messages is a very time-consuming task that can be much helped by machine learning. Most message veracity verification methods only exploit textual contents and metadata. Very few take both textual and visual contents, and more particularly images, into account. Moreover, prior works have used many classical machine learning models to detect rumors. However, although recent studies have proven the effectiveness of ensemble machine learning approaches, such models have seldom been applied. Thus, in this paper, we propose a set of advanced image features that are inspired from the field of image quality assessment, and introduce the Multimodal fusiON framework to assess message veracIty in social neTwORks (MONITOR), which exploits all message features by exploring various machine learning models. Moreover, we demonstrate the effectiveness of ensemble learning algorithms for rumor detection by using five metalearning models. Eventually, we conduct extensive experiments on two real-world datasets. Results show that MONITOR outperforms state-of-the-art machine learning baselines and that ensemble models significantly increase MONITOR’s performance.

MONITOR: A Multimodal Fusion Framework to Assess Message Veracity in Social Networks

An image and text-based multimodal model for detecting fake news in OSN’s

Article 30 November 2022

TMIF: transformer-based multi-modal interactive fusion for automatic rumor detection

Article 31 March 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

After more than two decades of existence, social media platforms have attracted a large number of users. They enable the diffusion of information in real-time, albeit regardless of its credibility, for two main reasons. First, there is a lack of a means to verify the veracity of contents transiting on social media. Second, users often publish messages without verifying information validity and reliability. Consequently, social networks, and particularly microblogging platforms, are a fertile ground for spreading rumors.

Widespread rumors can pose a threat to the credibility of social media and cause harmful consequences in real life. Thus, the automatic assessment of information credibility on microblogs that we focus on is crucial to provide decision support to, e.g., fact checkers. This task requires to verify the truthfulness of messages related to a particular event and return a binary decision stating whether the message is authentic.

In the literature, most automatic rumor detection approaches address the task as a classification problem. They generally extract features from two aspects of messages: textual content (Pérez-Rosas et al. 2018) and social context (Wu and Liu 2018). However, the multimedia content of messages, particularly images that present a significant set of features, are little exploited.

In this paper, we second the hypothesis that the use of image properties is important in rumor verification. Images indeed play a crucial role in the news diffusion process. For example, in the dataset collected by Jin et al. (2017), the average number of messages with an attached image is more than eleven times that of plain text messages.

Figure 1 shows two sample rumors posted on Twitter. In Fig. 1a, it is hard to assess veracity from the text, but the likely-manipulated image hints at a rumor. In Fig. 1b, it is hard to assess veracity from both the text or the image because the image has been taken out of its original context.

Furthermore, most of the literature focuses on features to train a wide range of machine learning (Volkova and Jang 2018) and deep learning (Wang et al. 2018) methods. However, although recent studies demonstrate the effectiveness of ensemble learning (Gutierrez-Espinoza et al. 2020), such models are not applied for rumor detection.

Based on the above observations, we aim to leverage all the modalities of microblog messages for verifying rumors, that is, features extracted from the textual and social context content of messages, and up to now unused visual and statistical features derived from images. Consequently, all types of features must be fused to allow a supervised machine learning classifier to evaluate the credibility of messages. Moreover, motivated by the recent research on ensemble learning to classification problems (Pang et al. 2016), we design various metalearning models to investigate the performance of ensemble learning for rumor classification.

Our contribution is threefold. First, we propose the use of a set of image features inspired from the field of Image Quality Assessment (IQA) and we show that they contribute very effectively to the verification of message veracity. These metrics estimate the rate of noise and quantify the amount of visual degradation of any type in an image. They are proven to be good indicators for detecting fake images, even those generated by advanced techniques such as Generative Adversarial Networks (GANs) (Goodfellow et al. 2014). To the best of our knowledge, we are the first to systematically exploit this type of image features to check the veracity of microblog posts.

Second, we detail the Multimodal fusiON framework to assess message veracIty in social neTwORks (MONITOR) (Azri et al. 2021), which exploits all types of message features and leverages four machine learning models that provide explainability and interpretability about the taken decisions.

Third, we demonstrate the benefit of ensemble learning, by developing five metalearning models (soft and weighted average voting, stacking, blending, and super learner ensemble) that exploit the above four machine learning models, and we compare their performance with MONITOR’s. To the best of our knowledge, we are the first to apply metalearning models for tackling the rumor detection task.

Eventually, we conduct extensive experiments two real-world datasets to show the effectiveness of our rumor detection approach. MONITOR indeed outperforms all state-of-the-art machine learning baselines with an accuracy and F1-score of up to 96% and 89% on the MediaEval benchmark (Boididou et al. 2015) and the FakeNewsNet dataset (Shu et al. 2018), respectively. Furthermore, all metalearning algorithms notably increase MONITOR’s performance.

The remainder of this paper is organized as follows. In Section 2, we review all the research related to our problem. In Section 3, we detail MONITOR and especially its feature extraction and selection. In Section 4, we present and comment on the experimental results that we achieve with respect to state-of-the-art methods. In Section 5, we investigate and discuss the performance of ensemble models. Finally, in Section 6, we conclude this paper and outline future research.

2 Related Works

Related work can be divided into the following categories:

1.
non-image features and image features that are essential for checking the veracity of microblog posts,
2.
background information regarding ensemble learning models and their usage for rumor classification.

2.1 Non-image Features

Studies in the literature present a wide range of non-image features. These features may be divided into two subcategories, textual features and social context features. To classify a message as fake or real, Castillo et al. (2011) capture prominent statistics in tweets, such as count of words, capitalized characters and punctuation. Beyond these features, lexical words expressing specific semantics or sentiments are also counted. Many sentimental lexical features are proposed (Kwon et al. 2013), which utilize a sentiment tool called the Linguistic Inquiry and Word Count (LIWC) to count words in meaningful categories.

Other works exploit syntactic features, such as the number of keywords, the sentiment score or polarity of the sentence. Features based on topic models are used to understand messages and their underlying relations within a corpus. Wu et al. (2015) train a Latent Dirichlet Allocation model (Blei et al. 2003) with a defined set of topic features to summarize semantics for detecting rumors.

The social context describes the propagating process of a rumor (Shu et al. 2018). Social network features are extracted by constructing specific networks, such as diffusion (Kwon et al. 2013) or co-occurrence networks (Ruchansky et al. 2017).

Recent approaches detect fake news based on temporal-structure features. Kwon et al. (2017) studied the stability of features over time and found that, for rumor detection, linguistic and user features are suitable for early-stage, while structural and temporal features tend to have good performance in the long-term stage.

2.2 Image Features

Although images are widely shared on social networks, their potential for verifying the veracity of messages in microblogs is not sufficiently explored. Morris et al. (2012) assume that the user’s profile image has an important impact on information credibility. Images attached in messages bear very basic features. Wu et al. (2015) define a feature called “has multimedia” to mark whether the tweet has any picture, video or audio attached. Gupta et al. (2013) propose a classification model to identify fake images on Twitter during Hurricane Sandy. However, their work is still based on textual content features.

To automatically predict whether a tweet that shares multimedia content is fake or real, Boididou et al. (2015) propose the Verifying Multimedia Use (VMU) task. Textual and image forensics (Li et al. 2014) features are used as baseline features for this task. They conclude that Twitter media content is not amenable to image forensics and that forensics features do not lead to consistent VMU improvement (Boididou et al. 2018).

2.3 Ensemble Learning

Ensemble learning refers to the generation and combination of multiple inducers to solve a particular machine learning task. The intuitive explanation for the ensemble methodology stems from human nature. Often, decision making by a group of individuals results in more accurate, useful or correct outcome than a decision made by any one member of the group. This is generally referred to as the wisdom of the crowd (Surowiecki 2005). Using ensemble learning, the performance of poorly performing classifiers can be improved by creating, training and combining the output of multiple classifiers and thus result in a more robust classification. There are three main approaches for developing an ensemble learner (Zhang and Ma 2012):

boosting uses homogeneous-base models trained sequentially;
bagging (Bootstrap AGGregatING) uses homogeneous-base models trained in parallel;
stacking uses mostly heterogeneous-base models trained in parallel and combined using a metamodel.

By averaging (or voting) the output produced by the pool of classifiers, ensemble methods provide better predictions and avoid overfitting. Another reason that contributes to the better performance of ensemble learning is its ability in escaping from local minimums. By using multiple models, the search space becomes wider and the chance for finding a better output becomes higher (Sagi and Rokach 2018).

Recently ensemble learning methods have shown good performance in various applications, including solar irradiance prediction (Lee et al. 2020), slope stability analysis (Pham et al. 2021), natural language processing (Sangamnerkar et al. 2020), malware detection (Gupta and Rani 2020), COVID-19 detection (Singh et al. 2021), movie success detection (Lee et al. 2018) and blood donors detection (Kauten et al. 2021). Compared to other applications, rumor classification using ensemble learning techniques has been very little studied.

Kaur et al. (2020) propose a multilevel voting model for the fake news detection task. The study concludes that the proposed model outperforms both individual machine learning and ensemble learning models. To address the multiclass fake news detection problem, Kaliyar et al. (2019) use gradient boosting ensemble techniques and compare their performance with several individual machine learning models. Results demonstrate the effectiveness of the ensemble framework compared to existing benchmark performance. Finally, Al-Ash et al. (2019) find that the bagging approach provides superior performance than Support Vector Machines (SVMs), Multinomial Naïve Bayes (MNB) and Random Forest to detect fake news.

3 MONITOR

Microblog messages contain rich multimodal resources, such as text contents, surrounding social context and attached images. Our focus is to leverage this multimodal information to determine whether a message is true or false. Based on this idea, we propose a framework for verifying the veracity of messages. MONITOR’s detailed description is presented in this section.

3.1 Multimodal Fusion Overview

Figure 2 shows a general overview of MONITOR, which works in two main stages. First, we extract several features from the message’s text and the social context. Then, we apply a feature selection algorithm to identify relevant features, which form a first set of textual features. From the attached image, we derive statistics and efficient visual features inspired from the IQA field, which form a second set of image features. Second, we train a model by concatenating and normalizing the textual and image features sets to form a fusion vector. Several machine learning classifiers may learn from the fusion vector to distinguish the veracity of the message, i.e., real or fake.

3.2 Feature Extraction and Selection

To better extract features, we reviewed the best practices followed by information professionals, e.g., journalists, in verifying content generated by social network users. We based our thinking on relevant data from journalistic studies (Martin and Comm 2014) and the Verification Handbook (Silverman 2014). We define a set of features that are important to extract discriminating characteristics of rumors. These features are mainly derived from three principal aspects of news information: content, social context and visual content. The feature selection process is only applied to content and social context features sets to remove the irrelevant features that can negatively impact performance. Because our focus is the visual features set, we retain all these features in the learning process.

3.2.1 Message Content Features

Content features are extracted from the message’s text. We extract characteristics such as the length of a tweet and the number of words. We also include statistics such as the number of exclamation and question marks, as well as binary features indicating the existence or not of emoticons. Furthermore, other features are extracted from the linguistics of a text, including the number of positive and negative sentiment words. Additional binary features indicate whether the text contains personal pronouns.

We also calculate a readability score for each message using the Flesch Reading Ease method (Kincaid et al. 1975). The higher this score is, the easier the text is to read. Other features are extracted from the informative content provided by the specific communication style of the Twitter platform, such as the number of retweets, mentions (@), hashtags (#) and URLs.

3.2.2 Social Context Features

The social context reflects the relationships between different users. Therefore, social context features are extracted from the behavior of users and the propagation network. We capture several features from the users’ profiles, such as the number of followers and friends, the number of tweets the user has authored, the number of tweets the user has liked and whether the user is verified by the social media. We also extract features from the propagation tree that can be built from tweets and retweets, such as the depth of the retweet tree. Tables 1 and 2 describe the sets of content features and social context features extracted from each message.

Table 1 Content features

Full size table

Table 2 Social context features

Full size table

To improve the performance of MONITOR, we apply a feature selection algorithm on the feature sets listed in Tables 1 and 2. The details of the feature selection process are discussed in Section 4.

3.2.3 Image Features

To differentiate between false and real images in messages, we propose to exploit visual content features and visual statistical features that are extracted from the joined images.

Visual Content Features

Usually, a news consumer decides the image veracity based on his subjective perception, but how do we quantitatively represent the human perception of the quality of an image? The quality of an image means the amount of visual degradations of all types present in an image, such as noise, blocking artifacts, blurring, fading and so on.

The IQA field aims to quantify human perception of image quality by providing an objective score of image degradations based on computational models (Maître 2017). Such degradations are introduced during different processing stages, such as image acquisition, compression, storage, transmission and decompression. Inspired by the potential relevance of IQA metrics in our context, we use these metrics in an original way, for a purpose different from what they were created for. More precisely, we hypothesize that the quantitative evaluation of the quality of an image can be useful for veracity detection.

IQA is mainly divided into two areas of research: full-reference evaluation and no-reference evaluation. Full-reference algorithms compare the input image against a pristine reference image with no distortion. In no-reference algorithms, the only input is the image whose quality is to be measured. In our case, we do not have the original version of the posted image. Therefore, the approach that is fitting to our context is no-reference evaluation. We use three no-reference algorithms that have been demonstrated to be highly efficient: the Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) by Mittal et al. (2011), the Naturalness Image Quality Evaluator (NIQE) by Mittal et al. (2012) and the Perception based Image Quality Evaluator (PIQE) by Venkatanath et al. (2015).

For example, Fig. 3 displays the BRISQUE score computed for a natural image and its distorted versions (compression, noise and blurring distortions). The BRISQUE score is a non-negative scalar in the range [1, 100]. Lower values of the score reflect a better perceptual image quality.

No-reference IQA metrics are also good indicators for other types of image modifications, such as GAN-generated images. These techniques allow modifying the context and semantics of images in a very realistic way. Unlike many image analysis tasks, where both reference and reconstructed images are available, images generated by GANs may not have any reference image. This is the main reason for using no-reference IQA for evaluating this type of fake images. Figure 4 displays the BRISQUE score computed for real and fake images generated by image-to-image translation based on GANs (Zhu et al. 2017).

Statistical Features

From attached images, we define four statistical features from two aspects.

Number of images: A user can post one, several or no images. To denote this feature, we count the total number of images in a rumor event and the ratio of posts containing more then one image.
Spreading of images: During an event, some images are very replied and generate more comments than others. The ratio of such images is calculated to indicate this feature. Table 3 illustrates the description of our visual and statistical features. We use all of these features in the learning process.

Table 3 Description of image features

Full size table

3.3 Model Training

So far, we have obtained a first set of relevant textual features through a feature selection process. We have also a second set of image features composed of statistical and visual features. These two sets of features are scaled, normalized and concatenated to form the multimodal representation of a given message, which is learned by a supervised classifier. Several learning algorithms can be implemented fore message veracity classification. We investigate the algorithms that provide the best performance in Section 4.

4 Regular Machine Learning Experiments

In this section, we conduct extensive experiments on two public datasets. First, we present statistics about the datasets we use. Then, we describe the experimental settings: a brief review of state-of-the-art features for news verification and a selection of the best of these textual features as baselines. Finally, we present experimental results and analyze the features to achieve insights with MONITOR.

4.1 Datasets

To evaluate MONITOR’s performance, we conduct experiments on two well-established public datasets for rumor detection. The detailed statistics of these two datasets are listed in Table 4.

Table 4 MediaEval and FakeNewsNet statistics

Full size table

4.1.1 MediaEval

MediaEval (Boididou et al. 2015) is collected from Twitter and includes all three characteristics: text, social context and images. It is designed for message-level verification. The dataset has two parts: a development set containing about 9,000 rumor and 6,000 non-rumor tweets from 17 rumor-related events; a test set containing about 2,000 tweets from another batch of 35 rumor-related events. We remove tweets without any text nor image, thus obtaining a final dataset including 411 distinct images associated with 6,225 real and 7,558 fake tweets, respectively.

4.1.2 FakeNewsNet

FakeNewsNet (Shu et al. 2018) is one of the most comprehensive fake news detection benchmark. Fake and real news articles are collected from the fact-checking websites PolitiFact and GossipCop. Since we are particularly interested in images in this work, we extract and exploit the image information of all tweets. To keep the dataset balanced, we randomly choose 2,566 real and 2,587 fake news events. After removing tweets without images, we obtain 56,369 tweets and 59,838 images.

4.2 Experimental Settings

4.2.1 Baseline Features

We compare the effectiveness of our feature set with the best textual features from the literature. First, we adopt the 15 best features extracted by Castillo et al. (2011) to analyze the information credibility of news propagated through Twitter. We also collect a total of 40 additional textual features from the literature (Gupta et al. 2013, 2012; Kwon et al. 2013; Wu et al. 2015), which are extracted from text content, user information and propagation properties (Table 5).

Table 5 Features from the literature

Full size table

4.2.2 Feature Sets

The features labeled Textual are the best features selected among message content and social context features (Tables 1 and 2). We select them with the information gain ratio method (Karegowda et al. 2010), which helps select a subset of 15 relevant textual features with an information gain larger than zero (Table 6).

Table 6 Best textual features selected

Full size table

The features labeled Image are all the image features listed in Table 3. The features labeled MONITOR are the feature set that we propose, consisting of the fusion of textual and image feature sets. The features labeled Castillo are the above-mentioned best 15 textual features. Eventually, the features labeled Wu are the 40 textual features identified in literature.

4.2.3 Model Construction

We cannot know beforehand what model will be good for our problem or what configuration to use. By analyzing both datasets, we found that classes are partially linearly separable in some dimensions. Thus, we evaluate a mix of simple linear and non-linear algorithms. The best result are achieved by four supervised classification algorithms: Classification and Regression Trees (CART), k-Nearest Neighbors (KNN), Support Vector Machines (SVMs) and Random Forest (RF). Then, we optimize the hyper-parameters of each model (Table 7) by testing multiple settings using the GridSearchCV function from the Python Scikit-Learn library (Pedregosa et al. 2011). Subsequently, we perform training and validation for each model through a 5-fold cross-validation to obtain stable out-sample results. To implement the models, we again use scikit-learn. Note that, for MediaEval, we retain the same data split scheme. For FakeNewsNet, we randomly divide data into training and testing subsets with the ratio 0.8:0.2. Table 8 present the results of our experiments.

Table 7 Hyper-parameters configuration space

Full size table

Table 8 Performance of individual machine learning models. Bold entries indicates the best performance achieved for each evaluation metric

Full size table

4.3 Classification Results

From the classification results recorded in Table 8, we can make the following observations.

4.3.1 Performance Comparison

With MONITOR, using both image and textual feature allows all classification algorithms to achieve better performance than baselines. Among the four classification models, RF generates the best accuracy: 96.2% on MediaEval and 88.9% on FakeNewsNet, performing 26% and 18% better than Castillo and 24% and 15% than Wu, still on MediaEval and FakeNewsNet, respectively.

Compared to the 15 “best” textual feature set, RF improves the accuracy by more than 22% and 10% with image features only. Similarly, the other three algorithms achieve

While image features play a crucial role in rumor verification, we must not ignore the effectiveness of textual features. The role of image and textual features is complementary. When the two sets of features are combined, performance is significantly boosted.

4.3.2 Illustration by Example

To more clearly show the complementarity between text and images, we compare the results achieved with MONITOR and single modality approaches (text only or image only). Fake rumor messages from Fig. 1 (Section 1) are correctly detected as false by MONITOR, while using either only textual or only image modalities yields a true result.

In the tweet from Fig. 1a, the text content solely describes the attached image without giving any signs about the veracity of the tweet. This is why the textual modality identifies this tweet as real. It is the attached image that looks quite suspicious. By combining textual and image contents, MONITOR can identify the veracity of the tweet with a high score, exploiting some clues from the image to get the right classification.

The tweet from Fig. 1b is an example of rumor correctly classified by MONITOR, but incorrectly classified when only using the visual modality. The image seems normal and its complex semantics are very difficult to capture by the image modality. However, the words with strong emotions in the text indicate that it might be a suspicious message. By combining the textual and image modalities, MONITOR can classify the tweet with a high confidence score.

4.4 Feature Analysis

The advantage of our approach is that we can achieve some elements of interpretability. To this aim, we conduct an analysis to illustrate the importance of each feature set. We depict the first most 15 important features achieved by RF in Fig. 5, which shows that, for both datasets, visual characteristics are in the top-five features. The remaining features are a mix of text content and social context features. These results validate the effectiveness of the IQA image features, as well as the the importance of fusing several modalities in the process of rumor verification.

Eventually, to illustrate the discriminating capacity of these features, we deploy box plots for each of the 15 top variables on both datasets. Figure 6 shows that several features exhibit a significant difference between fake and real classes, which explains our good results.

4.5 Early and Late Fusion

In our previous experiments, we fuse visual and textual modalities into a single multimodal vector before the learning and classification steps, in the so-called early fusion manner. Another way to merge features is late fusion.

This class of fusion scheme works at the decision level, by combining the prediction scores available for each modality. Late fusion starts with the extraction of unimodal features. In contrast to early fusion, where features are combined into a multimodal representation, late fusion approaches learn directly from unimodal features. The predicted probability scores are combined afterwards to yield a final detection score. Several methods help combine scores, such as averaging, voting or using another machine learning method to learn how to best combine predictions.

To apply late fusion, we train two Random Forest (RF) classifiers by learning separately the visual and textual features (Fig. 7).

To obtain the final classification results, the predicted probabilities of the both classifiers are combined with (1) equal weights, by assuming that the two models are equally skillful and make the same proportional contribution to the final prediction; and (2) averaging the (optimized) weights by feeding the classifiers’ output to a logistic regression model.

Figure 8 shows that, for both datasets, the early fusion method and the two late fusion strategies, i.e., equal weight and optimized weight, boost the prediction with different rates using separately two sets of features. Early fusion has the highest performance score, while for both late fusion techniques, equal weight is slightly more efficient than optimized weight.

Late fusion’s performance is lower than that of early fusion because, when we train two models separately on visual and textual features, some dependencies between features are lost. Practically, there are some correlations between features, e.g., between BRISQUE and Num_Mention or between PIQE and Text_Length. The potential loss of correlation in the mixed feature space is a drawback of late fusion. Another disadvantage of late fusion is its cost in terms of learning effort, as every modality requires a separate supervised learning stage. Moreover, the combined representation requires an additional learning stage.

5 Ensemble Learning Performance

Applied machine learning often involves fitting and evaluating models on a dataset. Given that we cannot know what model will perform best on the dataset beforehand, this may involve a lot of trial and error until we find a model that performs good enough. This is akin to making a decision using the single expert we can find. A complementary approach is to prepare multiple, different models, and then combine their predictions using an ensemble machine learning model.

Because ensemble learning strategies such as bagging and boosting typically involve a single machine learning algorithm (generally a decision tree), we use instead the stacking strategy (also called metalearning) that seeks for a diverse group of members by varying model types. Figure 9 summarizes the key elements of a stacking ensemble:

an unchanged training dataset;
various machine learning algorithms (base models) for each ensemble member;
a machine learning model (metamodel) to learn how to best combine predictions.

To measure the performance of ensemble learning models for rumor detection, we develop five metamodels as variants of the stacking strategy.

5.1 Metamodels

5.1.1 Voting Ensemble

We construct two voting models. The first one is a soft voting model called MONITOR\(_{sv}\) that sums the predictions made by the classification models listed in Table 8 and predicts the class label with the largest sum probability. The second model is a weighted average voting model called MONITOR\(_{wav}\) where model votes are proportional to model performance. The performance of each ensemble model on the training dataset will be used as the relative weighting of the model when making predictions. Performance is calculated using classification accuracy as a ratio of correct predictions ranging between 0 and 1, with larger values meaning a better model and, in turn, more contribution to the prediction.

5.1.2 Canonical Stacking Ensemble

Following Wolpert (1992)’s canonical stacking strategy (Fig. 8), we construct a model called MONITOR\(_{st}\). Concretely, we use three repeats of a stratified 10-fold cross-validation on the four classification models to prepare the training dataset (predictions) with the logistic regression metamodel. Furthermore, we train the metamodel on the prepared dataset as well as the original training dataset using a 5-fold cross-validation. This aims to provide an additional context to the metamodel to better combine predictions.

5.1.3 Blending Ensemble

Blending was the term commonly used for stacking ensembles during the Netflix prize in 2009. The prize involved teams seeking movie recommendations that performed better than the native Netflix algorithm. A one million US dollar prize was awarded to the team achieving a 10% performance improvement.

In this stacking-type ensemble, base models are fit on the training dataset and the metamodel is trained on predictions made by each base model on the validation dataset. At the time we are writing this paper, Scikit-learn does not support blending. Thus, we implement a blending model called MONITOR\(_{bld}\) using scikit-learn models.

To implement our model, we need to split the dataset, first into training and test sets. Then, the training set is split again into two subsets used to train base models and the metamodel, respectively. We use a 50/50 split on the training and test sets and a 67/33 split on the train and validation sets (Fig. 10). Furthermore, we choose logistic regression as a metamodel (the blender), for the same reasons we mentioned about canonical stacking. We summarise the key implementation steps of our model in Algorithm 1.

5.1.4 Super Learner Ensemble

A super learner ensemble (Van der Laan et al. 2007) is a specific stacking configuration where all base models use the same k-fold splits of data, and a metamodel is fit on the out-of-fold predictions from each model. We summarize this procedure in Algorithm 2. Moreover, Fig. 11, which is reproduced from the original paper by Van der Laan et al. (2007), depicts its data flow. We use the MLENS Python library (Flennerhag 2017) to implement the super learner model called MONITOR\(_{sl}\), where we split the training data into \(k = 10\) folds. The number of base models is set to \(m = 4\)(i.e. KNN, CART, SVM and RF).

Table 9 summarizes the results achieved by the best individual machine learning model (RF) and the five stacking algorithms.

Table 9 Performance of MONITOR and stacking ensemble models. Bold entries indicates the best performance achieved for each evaluation metric

Full size table

5.2 Result Analysis

Our comparative analysis of experimental results shows that all metalearning models are more efficient than the best individual machine learning model (RF), because by combining multiple models, the errors from a single base-model are likely compensated by the other models. As a result, the overall prediction performance of the ensemble is better than that of any single base-model.

Moreover, for both datasets, the canonical stacking algorithm outperforms all models with 98.4% and 93.6% of accuracy on MediaEval and FakeNewsNet dataset, respectively. The stacking model indeed takes advantages from the diversity of predictions made by contributing models. That is, all algorithms are skillful on the classification problem, but in different ways. Figures 12 and 13 depicts the accuracy score box plot and the Receiver Operating Curve (ROC) for the canonical stacking ensemble model compared to the standalone machine learning algorithms (MONITOR-RF, CART, KNN and SVM) on MediaEval and FakeNewsNet, respectively.

Among the five ensemble models, the soft voting algorithm achieves the worst results, because it treats all models the same, i.e., all models contribute equally to the prediction. Although the canonical stacking algorithm performs the best, the blending and super learner algorithms achieve scores that are very close to those of stacking and therefore turn to be useful too for rumor classification.

6 Conclusion and Perspectives

To assess the veracity of messages posted on social networks, most of the existing techniques ignore visual contents and use traditional machine learning models for classification, although ensemble approaches are considered the state-of-the-art solutions for many machine learning challenges. Thence, in this paper, to improve the performance of message verification, we propose a multimodal fusion framework called MONITOR that uses features extracted from the textual content of messages, the social context and image features that have not been considered until now. We compare the performance of MONITOR with five metalearning ensemble models by combining four base-predictors (KNN, CART, SVM and RF). Extensive experiments conducted on the MediaEval benchmark and the FakeNewsNet dataset show that:

the image features that we introduce play a key role in message veracity assessment;
no single homogeneous feature set can generate the best results alone;
all ensemble algorithms outperform the best single base-model (RF), and canonical stacking achieves the best performance on both datasets.

Our future research includes two directions. In the short term, we plan to experiment with other, larger datasets and vary the type, combination and number of base models in the ensemble. Second, we plan to compare MONITOR’s performance with a deep learning-based approach for rumor classification, deepMONITOR (Azri et al. 2021), with the aim of studying the tradeoff between classification accuracy, computing complexity and explainability.

References

Al-Ash, H.S., Putri, M.F., Mursanto, P., & Bustamam, A. (2019). Ensemble learning approach on indonesian fake news classification. In 3rd international conference on informatics and computational sciences 2019 (ICICoS) (pp. 1–6)
Azri, A., Favre, C., Harbi, N., Darmont, J., & Noûs, C. (2021a). Calling to CNNLSTM for Rumor Detection: A deep multi-channel model for message Veracity Classification in Microblogs. In Joint European conference on machine learning and knowledge discovery in databases(ECML PKDD 2021) (Vol. 12979, pp. 497–513). Bilbao, Spain
Azri, A., Favre, C., Harbi, N., Darmont, J., & Noûs, C. (2021b). MONITOR: A Multimodal Fusion Framework to Assess Message Veracity in Social Networks. In 25th European conference on advances in databases and information systems(ADBIS 2021) (Vol. 12843, pp. 73-87).
Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3 , 993-1022.
Boididou, C., Andreadou, K., Papadopoulos, S., Dang-Nguyen, D.-T., Boato, G., Riegler, M., et al. (2015). Verifying multimedia use at mediaeval 2015. MediaEval, 3(3), 7.
Google Scholar
Boididou, C., Papadopoulos, S., Zampoglou, M., Apostolidis, L., Papadopoulou, O., & Kompatsiaris, Y. (2018). Detection and visualization of misleading content on Twitter. International Journal of Multimedia Information Retrieval, 7(1), 71–86.
Article Google Scholar
Castillo, C., Mendoza, M., & Poblete, B. (2011). Information credibility on twitter. In Proceedings of the 20th international conference on World wide web (pp. 675–684)
Flennerhag, S. (2017). Ml-ensemble. Retrieved from https://dx.doi.org/10.5281/zenodo.1042144
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (Vol. 27, pp. 2672–2680)
Gupta, A., Lamba, H., Kumaraguru, P., & Joshi, A. (2013). Faking sandy: characterizing and identifying fake images on twitter during hurricane sandy. In Proceedings of the 22nd international conference on World Wide Web (pp. 729–736)
Gupta, D., & Rani, R. (2020). Improving malware detection using big data and ensemble learning. Computers & Electrical Engineering, 86, 106729.
Article Google Scholar
Gupta, M., Zhao, P., & Han, J. (2012). Evaluating event credibility on twitter. In Proceedings of the 2012 SIAM international conference on data mining (pp. 153–164)
Gutierrez-Espinoza, L., Abri, F., Namin, A.S., Jones, K.S., & Sears, D.R. (2020). Fake reviews detection through ensemble learning. arXiv preprint arXiv:2006.07912
Jin, Z., Cao, J., Zhang, Y., Zhou, J., & Tian, Q. (2017). Novel visual and statistical image features for microblogs news verification. IEEE transactions on multimedia, 19(3), 598–608.
Article Google Scholar
Kaliyar, R.K., Goswami, A., & Narang, P. (2019). Multiclass fake news detection using ensemble machine learning. In IEEE 9th international conference on advanced computing (IACC) (pp. 103–107)
Karegowda, A. G., Manjunath, A., & Jayaram, M. (2010). Comparative study of attribute selection using gain ratio and correlation based feature selection. International Journal of Information Technology and Knowledge Management, 2(2), 271–277.
Google Scholar
Kaur, S., Kumar, P., & Kumaraguru, P. (2020). Automating fake news detection system using multi-level voting model. Soft Computing, 24(12), 9049–9069.
Article Google Scholar
Kauten, C., Gupta, A., Qin, X., & Richey, G. (2021). Predicting blood donors using machine learning techniques. Information Systems Frontiers, 1–16
Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., & Chissom, B.S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel
Kwon, S., Cha, M., & Jung, K. (2017). Rumor detection over varying time windows. PloS One, 12(1), e0168344.
Article Google Scholar
Kwon, S., Cha, M., Jung, K., Chen, W., & Wang, Y. (2013). Prominent features of rumor propagation in online social media. In 2013 IEEE 13th international conference on data mining (pp. 1103–1108)
Lee, J., Wang, W., Harrou, F., & Sun, Y. (2020). Reliable solar irradiance prediction using ensemble learning-based models: A comparative study. Energy Conversion and Management, 208, 112582.
Article Google Scholar
Lee, K., Park, J., Kim, I., & Choi, Y. (2018). Predicting movie success with machine learning techniques: ways to im-prove accuracy. Information Systems Frontiers, 20(3), 577–588.
Article Google Scholar
Li, J., Li, X., Yang, B., & Sun, X. (2014). Segmentation-based image copy-move forgery detection scheme. IEEE Transactions on Information Forensics and Security, 10(3), 507–518.
Google Scholar
Maître, H. (2017). From photon to pixel: the digital camera handbook. Wiley.
Book Google Scholar
Martin, N., & Comm, B. (2014). Information verification in the age of digital journalism. In Special Libraries Association (SLA) (pp. 8–10)
Mittal, A., Moorthy, A.K., & Bovik, A.C. (2011). Blind/referenceless image spatial quality evaluator. In 2011 conference record of the forty fifth asilomar conference on signals, systems and computers (ASILOMAR) (pp. 723–727)
Mittal, A., Soundararajan, R., & Bovik, A. C. (2012). Making a completely blind image quality analyzer. IEEE Signal Processing Letters, 20(3), 209–212.
Article Google Scholar
Morris, M.R., Counts, S., Roseway, A., Hoff, A., & Schwarz, J. (2012). Tweeting is believing?: understanding microblog credibility perceptions. Proceedings of the acm 2012 conference on computer supported cooperative work (pp. 441–450)
Pang, Y., Xue, X., & Namin, A.S. (2016). Early identification of vulnerable software components via ensemble learning. In 15th IEEE international conference on machine learning and applications (ICMLA) (pp. 476–481)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830.
Google Scholar
Pérez-Rosas, V., Kleinberg, B., Lefevre, A., & Mihalcea, R. (2018). Automatic detection of fake news. In Proceedings of the 27th international conference on computational linguistics (pp. 3391–3401). Santa Fe, New Mexico, USA: ACL. Retrieved from https://www.aclweb.org/anthology/C18-1287
Pham, K., Kim, D., Park, S., & Choi, H. (2021). Ensemble learning-based classification models for slope stability analysis. Catena, 196, 104886.
Article Google Scholar
Ruchansky, N., Seo, S., & Liu, Y. (2017). Csi: A hybrid deep model for fake news detection. In Proceedings of the 2017 ACM on conference on information and knowledge management (pp. 797–806)
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.
Google Scholar
Sangamnerkar, S., Srinivasan, R., Christhuraj, M., & Sukumaran, R. (2020). An ensemble technique to detect fabricated news article using machine learning and natural language processing techniques. In International conference for emerging technology (INCET) (pp. 1–7)
Shu, K., Mahudeswaran, D., Wang, S., Lee, D., & Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286
Shu, K., Wang, S., Liu, H. (2018). Understanding user profiles on social media for fake news detection. In IEEE Conference on Multimedia Information Processing and Retrieval (pp. 430–435)
Silverman, C. (2014). Verification handbook: An ultimate guideline on digital age sourcing for emergency coverage. European Journalism Centre
Singh, P. D., Kaur, R., Singh, K. D., & Dhiman, G. (2021). A novel ensemblebased classifier for detecting the covid-19 disease for infected patients. Information Systems Frontiers, 23(6), 1385–1401.
Article Google Scholar
Surowiecki, J. (2005). The wisdom of crowds. Anchor
Van der Laan, M.J., Polley, E.C., & Hubbard, A.E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1)
Venkatanath, N., Praneeth, D., Bh, M.C., Channappayya, S.S., & Medasani, S.S. (2015). Blind image quality evaluation using perception based features. In Twenty first national conference on communications (NCC) (pp. 1–6)
Volkova, S., & Jang, J.Y. (2018). Misleading or falsification: Inferring deceptive strategies and types in online news and social media. In Companion Proceedings of the The Web Conference 2018 (pp. 575–583)
Wang, Y., Ma, F., Jin, Z., Yuan, Y., Xun, G., Jha, K., & Gao, J. (2018). Eann: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining (pp. 849–857)
Wolpert, D.H. (1992). Stacked generalization. Neural networks, 5(2), 241-259.
Wu, K., Yang, S., & Zhu, K.Q. (2015). False rumors detection on sina weibo by propagation structures. In IEEE 31st international conference on data engineering (pp. 651-662)
Wu, L., & Liu, H. (2018). Tracing fake-news footprints: Characterizing social media messages by how they propagate. In Proceedings of the eleventh ACM international conference on Web Search and Data Mining (pp. 637–645).
Zhang, C., & Ma, Y. (2012). Ensemble machine learning: methods and applications. Springer.
Book Google Scholar
Zhu, J.-Y., Park, T., Isola, P., & Efros, A.A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision

Download references

Author information

Authors and Affiliations

Université de Lyon, Lyon 2, UR ERIC, 5 avenue Pierre Mendès France, 69676, Bron Cedex, France
Abderrazek Azri, Cécile Favre, Nouria Harbi & Jérôme Darmont
Laboratoire Cogitamus, Université de Lyon, Lyon 2, Bron Cedex, France
Camille Noûs

Authors

Abderrazek Azri
View author publications
You can also search for this author in PubMed Google Scholar
Cécile Favre
View author publications
You can also search for this author in PubMed Google Scholar
Nouria Harbi
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Darmont
View author publications
You can also search for this author in PubMed Google Scholar
Camille Noûs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abderrazek Azri.

Ethics declarations

Conflicts of Interest

The authors have no conficting interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Azri, A., Favre, C., Harbi, N. et al. Rumor Classification through a Multimodal Fusion Framework and Ensemble Learning. Inf Syst Front 25, 1795–1810 (2023). https://doi.org/10.1007/s10796-022-10315-z

Download citation

Accepted: 19 July 2022
Published: 03 August 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10796-022-10315-z

Rumor Classification through a Multimodal Fusion Framework and Ensemble Learning

Abstract

Similar content being viewed by others

MONITOR: A Multimodal Fusion Framework to Assess Message Veracity in Social Networks

An image and text-based multimodal model for detecting fake news in OSN’s

TMIF: transformer-based multi-modal interactive fusion for automatic rumor detection

Explore related subjects

1 Introduction

2 Related Works

2.1 Non-image Features

2.2 Image Features

2.3 Ensemble Learning

3 MONITOR

3.1 Multimodal Fusion Overview

3.2 Feature Extraction and Selection

3.2.1 Message Content Features

3.2.2 Social Context Features

3.2.3 Image Features

Visual Content Features

Statistical Features

3.3 Model Training

4 Regular Machine Learning Experiments

4.1 Datasets

4.1.1 MediaEval

4.1.2 FakeNewsNet

4.2 Experimental Settings

4.2.1 Baseline Features

4.2.2 Feature Sets

4.2.3 Model Construction

4.3 Classification Results

4.3.1 Performance Comparison

4.3.2 Illustration by Example

4.4 Feature Analysis

4.5 Early and Late Fusion

5 Ensemble Learning Performance

5.1 Metamodels

5.1.1 Voting Ensemble

5.1.2 Canonical Stacking Ensemble

5.1.3 Blending Ensemble

5.1.4 Super Learner Ensemble

5.2 Result Analysis

6 Conclusion and Perspectives

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation