Keywords

1 Introduction

1.1 Motivation

Social media platforms, such as TwitterFootnote 1 and Chinese Sina Weibo,Footnote 2 have become important access where people acquire the latest news and express their opinions freely.Footnote 3 , Footnote 4 However, the convenience and openness of social media have also promoted the proliferation of fake news, i.e., news with intentionally false information, which not only disturbed the cyberspace order but also caused many detrimental effects on real-world events. For example, in the political field, during the month before the 2016 U.S. presidential election campaign, the Americans encountered between one and three fake stories on average from known publishers[1], which inevitably misled the voters and influenced the election results; In the economic field, a piece of fake news claiming that Barack Obama was injured in an explosion wiped out $130 billion in stock valueFootnote 5; In the social field, dozens of innocent people were beaten to death by locals in India because of a piece of fake news about child trafficking that was widely spread on social media.Footnote 6 Hence, the automatic detection of fake news has become an urgent problem of great concern in recent years [18, 33, 49].

The development of multi-media technology promotes the evolution of self-media news from text-based posts to multimedia posts with images or videos, which attracts more attention from consumers and provides more credible storytelling. On the one hand, as a vivid description form, the visual content including images and videos is more attractive and salient than plain text and consequently boosts the news propagation. For instance, tweets with images get 18% more clicks, 89% more likes, and 150% more retweets than those without images.Footnote 7 On the other hand, visual content is often used as evidence of a story in our common sense, which can increase the credibility of the news.Footnote 8 Unfortunately, this advantage is also taken by fake news. For rapid dissemination, fake news usually contains misrepresented or even tampered images or videos to attract and mislead consumers. As a result, visual content has become an important part of fake news that cannot be neglected, making multimedia fake news detection a new challenge.

Multimedia fake news detection aims at effectively utilizing the information of several modalities, such as textual, visual and social modalities, to detect fake news. Visual modality can provide abundant visual information, which is preliminarily proven to be effective in fake news detection [15]. However, although the importance of exploiting visual content have been revealed, our understanding about the role of visual content in fake news detection remains limited. To further facilitate research on this problem, we present a comprehensive review of the visual content in fake news in this chapter, including the problem definition, available visual characteristics, representative detection approaches and challenging problems.

1.2 Problem Definition

In this subsection, we introduce the concept of fake news and analyze the different types of visual content in fake news.

Fake news is widely defined as news articles that are intentionally and verifiably false and could mislead consumers [1, 20, 33]. On the context of social multimedia, news articles refer to news posts with multimedia content that are published by users, so the general definition of fake news has been further refined [3, 5, 6, 46]. Formally, we state the refined definition as follows,

Definition 1.1

A piece of fake news is a news post that shares multimedia content that does not faithfully represent the event that it refers to.

In real-world scenarios, the visual content in fake news can be broadly classified into three categories: (1) visual content that is deliberately manipulated (also known as tampering, doctoring or photoshopping) or automatically generated by deep generative networks, which equals to fake images/videos in our common sense (see Fig. 1a), (2) visual content from an irrelevant event, such as a past event, a staged work or an artwork, that is reposted as being captured in the context of an emerging event (see Fig. 1b), or (3) visual content that is real (not edited) but is published together with a false claim about the depicted event (see Fig. 1c). All examples in Fig. 1 fall under our definition of fake news, because the images and associated texts jointly convey the misleading information regardless of the veracity of the textual or the visual content itself. For this reason, fake news is also referred to as misleading content[6] or fauxtography[46] in the context of social multimedia.

Fig. 1
figure 1

Examples of the visual content in fake news: (a) A tampered image where Putin is spliced on the middle seat at G-20 to show that he is in the center position of an intense discussion among other world leaders; (b) A real image captured in 2009 New York air crash, but it is claimed to be the wrecked Malaysia Airlines MH370 in 2014; (c) A real image taken at the moment when Hillary Clinton accidentally stumbled, but it was maliciously interpreted as evidence of Clinton’s failing health

1.3 Organization

The remainder of this chapter is organized as follows. In Chap. 2, we introduce available visual features for fake news detection. We continue to present existing approaches utilizing visual content to detect fake news in Chap. 3. In Chap. 4, we discuss several challenging problems for multimedia fake news detection. Finally, we summarize available data repositories, tools (or software systems) and relevant competitions about multimedia fake news detection research in the appendix.

2 What Visual Content Tells?

Visual content has been shown as an important promoter for fake news propaganda.Footnote 9 At the same time, visual content also tells abundant cues for detecting fake news. To capture the distinctive characteristics of fake news, works extracted visual features from visual content (generally, images and videos), which can be categorized into four types: forensics features, semantic features, statistical features and context features.

2.1 Forensics Features

Since the addressed problem is the verification of multimedia posts, one reasonable approach would be to directly verify the truth of visual content, i.e., whether the image or video is captured in the event. Intuitively, if the visual content has undergone manipulation or severe re-compression, or is generated by deep learning techniques, the news post that it belongs to is likely to be fake. To access the authenticity, (blind) forensics features which can highlight the digitally edited traces of the visual content, are exploited in fake news detection from different perspectives, including the manipulation detection, generation detection and re-compression detection.

2.1.1 Manipulation Detection

Manipulation detection aims at looking for patterns or discontinuities left by operations such as splicing, copy-move and removal. The splicing refers to copying a part of one image and inserting it into another, while the copy-move and removal both happen in the same image. Because very few works [3] directly used these features in fake news detection yet, we also investigated the features mentioned in related works and summarized as follows:

  • Camera-related features are particular patterns caused by the imaging pipeline, such as the sensor pattern noise and color filter array interpolation patterns, which can be destroyed by manipulation. In previous works, Photo-Response Non-Uniformity [11], noise inconsistencies [26] and local interpolation artifacts [10] were used to capture the change of those patterns.

  • Discontinuities in spatial features are often left by forgery operations. To highlight these cues, gray-level run length features [47] and local binary patterns over the steerable-pyramid-transformed image [28] were exploited.

Note that some of them are only applicable to specific types of manipulation, which is unknown in practice. Also, some widely-spread manipulated images may have undergone multiple types of processing, increasing the challenge of capturing the traces of manipulation.

2.1.2 Generation Detection

As the rapid improvement of deep generative networks (especially generative adversarial network, GAN [12]), people can easily generate more photorealistic images and videos, making it hard to distinguish from natural ones. These misleading generated image and videos are often obtained by modifying the semantically-focused elements, for instance, the faces (mostly of celebrities), raising new threat to the trustworthiness of the visual content.

For generated fake images, existing works mostly focus on detecting with signal-level features. In the pixel domain, the co-occurrence matrices on three color channels were used for capturing spatial correlation characteristics, which were fed into the following convolutional neural network (CNN) for detection [29]. In contrast, McCloskey et al. started with the observation in the frequency domain that GAN images have more overlapping spectral responses among the RGB channels and negative weights than natural ones [27]. To represent these differences, this work introduced intensity noise histograms and over-/under- exposed rate.

For generated fake videos, most works are devoted to the detection of DeepFakes, a series of popular implementations for superimposing existing faces onto source videos. Works for DeepFakes detection mostly focused on the local features caused by the transformation in face-swapping such as the lacking of realistic eye blinking [22], the errors of 3D head poses introduced in face splicing for detection [44], and the artifacts left in warping to match the original faces [23].

2.1.3 Re-compression Detection

A fake image or video mostly suffers multiple compression in two situations: one is that the visual content is manipulated and re-saved at last, while the other is that it is repeatedly downloaded from and uploaded to the social media platform. These two situations probably indicate deliberate manipulation of visual content or misuse of the outdated, so we can detect fake news by predicting whether the attached visual content has been re-compressed.

For images, MediaEval VMU Task [3] (see in Appendix) extracted features directly related to the compression according to [2, 21], including probability map of the aligned/non-aligned double JPEG compression, potential primary quantization steps for the first 6 Discrete Cosine Transform (DCT) coefficients of the aligned/non-aligned double JPEG compression and block artifact grid. By thresholding the aligned/non-aligned JPEG compression maps above, Boididou et al. created two binary maps considered as object and background respectively and extracted descriptive statistics (maximum, minimum, mean, median, most frequent value, standard deviation and variance) for classification [4]. Qi et al. calculated block DCT coefficients and then performed Fourier Transform on them for enhancement to highlight the periodicity in the frequency domain caused by re-compression [30]. Furthermore, because multiple spreads may cause a dramatic decrease of clarity, no-reference quality measurement [41] can also indicate re-compression.

For videos, the methods exploited the presence of spikes in the Fourier transform of the energy of the displaced frame difference over time [37], blocking artifacts [24] and DCT coefficients of a macroblock [38] to detect the double-compression (mostly in MPEG videos).

2.2 Semantic Features

Fake news exploits the individual vulnerabilities of people and thus often relies on sensational or even fake images to provoke anger or other emotional response of consumers for promoting the spread of fake news. Thus, images in fake news often show some distinct characteristics in comparison with real news at the semantic level, such as visual impacts [16] and emotional provocations [33, 36] as Fig. 2 shows. Next, we introduce how to effectively extract semantic features of the visual content for fake news detection.

Fig. 2
figure 2

Comparison of images in fake and real news images at the semantic level. We can find that fake news images are more visually striking and emotional provocative than real news images, even though they describe the same type of events such as fire (a), earthquake (b) and road collapse (c)

CNN has exhibited great power in understanding image semantics and obtaining corresponding feature representations, which can be used for various visual tasks. VGG [34] is one of the most popular CNN models, which is comprised of three basic types of layers: convolutional layers for extracting and transforming image features, pooling layers for reducing the parameters, and fully connected layers for classification tasks (see Fig. 3). Most of existing works based on multimedia content adopted the VGG model to extract visual semantic features for fake news detection [9, 15, 40].

Fig. 3
figure 3

Detailed architecture of the VGG16 framework

In addition to the basic CNN, some recent works proposed novel CNN-based models to better capture the visual semantic characteristics of fake news. For example, Qi et al. proposed a multi-domain visual neural network (MVNN) to fuse the visual information of frequency and pixel domains for detecting fake news, of which the pixel sub-network was used to extract visual semantic features (see Fig. 4) [30]. Specifically, two motivations were illustrated for the model design. First, CNN learns high-level semantic representations through layer-by-layer abstraction from local to the global view, while the low-level features will inevitably suffer some losses in the process of abstraction. Considering these semantic cues such as emotional provocations are related to many visual factors from low-level to high-level [19], a multi-branch CNN network was adopted to extract features of different semantic levels in the pixel sub-network. Second, there are strong bidirectional dependencies between different levels of features. For example, middle-level features such as textures, consist of low-level features such as lines, and meanwhile compose high-level features such as objects. Therefore, the sub-network also utilized the bidirectional GRU to model the relations from two different views.

Fig. 4
figure 4

Detailed architecture of the pixel domain sub-network in MVNN. For an input image, a multi-branch CNN-RNN network is utilized to extract and fuse its pixel-domain features of different semantic levels

2.3 Statistical Features

Visual content also has different distribution patterns between fake and real news on social media [17]. Intuitively, people tend to report the news with images taken by themselves at the event scene. If the event is real, then various images taken by different witnesses would be posted while if fake, there are many repeatedly posted images with almost the same content, just as Fig. 5 shows. Thus, we introduce visual statistical features to reflect this distributional difference between real and fake news.

Fig. 5
figure 5

Examples of images in the real and fake news event. Obviously, images in the real news event (a) are much more diverse than those in the fake one (b)

Some works [17, 42, 43] used basic statistical features about the attached images to assist in fake news detection, usually from three aspects:

  • Count: The occurrence number of images. For example, Wu et al. used the number of illustrations to assist detect fake news posts [42, 43], while Jin et al. used the ratio of news posts containing at least one or more than one images to the total posts in a news event to detect fake news events [17].

  • Popularity: The number of sharing on social media, such as re-tweets and comments. Jin et al. defined the image with a high popularity as a hot image, and regarded the ratio of hot images to all distinct images in a news event as a statistical feature [17].

  • Type: Some images have a particular type in resolution or style. For example, long images are images with a very large length-to-width ratio. The ratio of these types of images was also counted as a statistical feature [17].

In addition to these basic statistical features, Jin et al. also proposed five advanced statistical features as follows [17]:

  • Visual Clarity Score (VCS): Visual clarity score measures the distribution difference between two image sets: one is the image set in a certain news event (event set) and the other is the image set containing images from all events (collection set). This feature was defined as the Kullback-Leibler divergence between the two language model representing the event set and collection set, respectively. The bag-of-words image representation such as SIFT was used to define the language models for images. Specifically, the visual clarity score is

    $$\displaystyle \begin{aligned} VCS=D_{KL}(p(w | c) \| p(w | k), \end{aligned} $$
    (1)

    where p(w|c) and p(w|k) denote the term frequency of visual word w in collection set and event set, respectively.

  • Visual Coherence Score (VCoS): Visual coherence score measures how coherent the images in a certain news event are. This feature is computed based on the visual similarity between any image pair within images in the target event image set, which is denoted as

    $$\displaystyle \begin{aligned} V C o S=\frac{1}{|N(N-1)|} \sum_{i, j=1, \cdots, N; i \neq j} \textit{sim}\left(x_{i}, x_{j}\right) \end{aligned} $$
    (2)

    where N is number of the images in the event set, \( \textit {sim}\left (x_{i}, x_{j}\right )\) is the visual similarity between image x i and image x j. In implementation, the similarity between images is computed based on their GIST features.

  • Visual Similarity Distribution Histogram (VSDH): Visual similarity distribution histogram describes the image similarity distribution in a fine-granularity level, which is computed based on the whole similarity matrix of all images in a target news event. The similarity matrix S is quantified into an H-bin histogram by mapping each element in the matrix into its corresponding bin, which results in a feature vector of H dimensions representing the similarity relations among images,

    $$\displaystyle \begin{aligned} VSDH(h)=\frac{1}{N^{2}}\left|\left\{(i, j) | i, j \leq N, m_{i, j} \in h-t h \operatorname{bin}\right\}\right|, h=1, \ldots, H \end{aligned} $$
    (3)
  • Visual Diversity Score (VDS): Visual diversity score measures the visual difference in the image set of a target news event. Assuming a ranking of images x 1, x 2, …, x N in the event image set R, the diversity score of all images in R is,

    $$\displaystyle \begin{aligned} \mathrm{VDS}=\sum_{i=1}^{N} \frac{1}{i} \sum_{j=1}^{i} (1-\textit{sim}\left(x_{i}, x_{j}\right)) \end{aligned} $$
    (4)

    In implementation, images are ranked according to their popularity on social media, based on the assumption that popular images may have better representation for the news event.

  • Visual Clustering Score (VCS): Visual clustering score evaluates the image distribution over all images in the news event from a clustering perspective. It was defined as the number of clusters formed by all images in a target news event. Hierarchical agglomerative clustering (HAC) algorithm is employed to cluster these images.

2.4 Context Features

According to our previous analysis, rumormongers usually use visual content from an irrelevant event to fabricate fake news. To make the fake news more reasonable, the selected visual content needs to be semantically coherent with the claim. Therefore, existing works about text-image semantic similarity aren’t applicable for these manipulations. Instead, one of the most effective methods is to utilize the context information of visual content to fact-check whether the current event is the same as the original event it belongs to. Specifically, we introduce the following context features, which mainly extracted from two sources: the metadata of visual content and the external knowledge such as relevant web pages.

2.4.1 Metadata

Metadata is text information pertaining to an image/video file that is usually embedded into the file. Metadata includes not only the details relevant to the image/video itself such as file size but also the information about its production, such as position and time, which are often used in manually fact-checking [7, 45]. However, these features are not that helpful in practice because they usually become unavailable after default processing by social media.

2.4.2 External Knowledge

In addition to metadata, some works extracted context features from the external knowledge obtained through reverse image search. In contrast to classical image search, reverse image search takes an image as input and returns lo relevant web pages that include the corresponding image, title, description and time. This process can be easily automated and applied to a large number of images via some search engine APIs like google reverse image search.Footnote 10 Next, we introduce three context features as follows.

  • Timespan: Timespan is defined as the time delay between the published time of the news and the earliest published time of the visual content. This feature is proposed to verify the originality of the visual content [35]. If the timespan is bigger than a specific threshold, then the visual content is probably from an irrelevant event.

  • Inter-claim similarity: Inter-claim similarity is defined as the similarity between the claim and the textual contents of these crawled websites. Considering that the text information of these crawled websites is helpful for understanding the original event of the image, this feature is used to verify the event consistency between the textual claim and corresponding visual content [48].

  • Platform credibility: Platform credibility means the credibility of the source platform where the visual content was published [48]. By using the dataset of Media Bias/Fact Check (MBFC),Footnote 11 a web site that provides factuality information about 2700+ media sources, each web page that is returned by the reverse image search was classified into the following categories: high factuality, low factuality and mixed factuality. The percentage of web pages from each category returned by the reverse image search was defined as the platform credibility feature.

3 How Visual Content Helps?

In the previous section, we introduced four types of visual features from different perspectives, i.e., forensics features, semantic features, statistical features and context features, for multimedia fake news detection. These features reflect the characteristics of visual content and are usually combined in practice for covering more situations. In this section, we discuss the details of several existing approaches utilizing visual content to detect fake news, which can be broadly classified into content-based approaches and knowledge-based approaches. Content-based approaches focus on capturing and combining the cues from contents of different modalities for fake news detection, without using any reference datasets. Knowledge-based approaches aim to use external sources to fact-check input claims. They assume the existence of a relatively large reference dataset and assess the integrity of the news post by comparing it to one or more posts retrieved from the reference dataset.

3.1 Content-Based Approaches

A complete news story consists of textual and visual content simultaneously, which both provide distinctive cues for detecting fake news. Therefore, recent works on this problem focus on utilize and effectively fuse information from multiple modalities. Mostly, these works simply used a common recurrent neural network (RNN) and a pre-trained CNN to obtain the textual and visual semantic features. Next, we introduce three state-of-the-art approaches that fuse multimodal information for fake news detection.

Jin et al. [15] first incorporated multi-modal contents via deep neural networks to solve fake news detection problem. It proposed an innovative RNN with an attention mechanism (attRNN, see Fig. 6a) for effectively fusing the textual, visual and social context features. For a given tweet, its text and social context are first fused with an LSTM for a joint representation. This representation is then fused with visual features extracted from pre-trained deep CNN. The output of the LSTM at each time step is employed as the neuron-level attention to coordinate visual features during the fusion.

Fig. 6
figure 6

Architectures of three state-of-the-art multi-modal models for fake news detection. (a) attRNN (b) EANN (c) MVAE

Wang et al. [40] proposed an end-to-end event adversarial neural network (EANN, see Fig. 6b) to detect newly-emerged fake news events based on event-invariant multi-modal features. It consists of three main components: the multi-modal feature extractor, the fake news detector, and the event discriminator. The multi-modal feature extractor is responsible for extracting the textual and visual features from posts. It cooperates with the fake news detector to learn the discriminable representation for fake news detection. The role of event discriminator is to remove the event-specific features and keep shared features among events.

Dhruv et al. [9] utilized a multi-modal variational autoencoder (MVAE, see Fig. 6c) trained jointly with a fake news detector to learn a shared representation of textual and visual information. The model consists of three main components: an encoder, a decoder and a fake news detector module. The variational autoencoder is capable of learning probabilistic latent variable models by optimizing a bound on the marginal likelihood of the observed data. The fake news detector then utilizes the multi-modal representations obtained from the bi-modal variational autoencoder to classify posts as fake or not.

3.2 Knowledge-Based Approaches

Real-world multimedia news is often composed of multiple modalities, like the image or a video with associated text and metadata, where information about an event is incompletely captured by each modality separately. Such multimedia data packages, i.e., the tuples of multi-modal information of the posts, are prone to manipulations, where a subset of these modalities can be modified to misrepresent or repurpose the multimedia package. However, the details being manipulated are subtle and often interleaved with the truth, causing that the content-based approaches can hardly detect these manipulations. Faced with this problem, knowledge-based approaches utilize external sources, a reference dataset of unmanipulated packages as a source of world knowledge, to help verify the semantic integrity of the multimedia news. In the following, we introduce some representative knowledge-based methods.

Jaiswal et al. [13] first formally defined the multimedia semantic integrity assessment problem and combined deep multi-modal representation learning with outlier detection methods to assess whether a caption was consistent with the image in its package (see Fig. 7). Data packages in the reference dataset were used to train a deep multi-modal representation learning model, which was then used to assess the integrity of query packages by calculating image-caption consistency scores and employing outlier detection models to find their inlierness with respect to the reference dataset.

Fig. 7
figure 7

The package integrity assessment system of [13]

Similarly, Sabir et al. [31] proposed a novel deep multi-modal model (see Fig. 8) to verify the integrity of multimedia packages. The proposed model consists of four modules: (1) feature extraction, (2) feature balancing, (3) package evaluation and (4) integrity assessment. For each query package, the model first uses similarity scoring to retrieve a package from the reference dataset, taking the query package and the top-1 related package as the input of the model. After passing to the feature extraction and balancing modules, query and retrieved packages are transformed into a single feature vector. The package evaluation module, the core of the proposed model, consists of the related package and single package sub-modules. The related package sub-module consisted of two siamese networks. The first network is a relationship classifier that verifies whether the query package and top-1 package are indeed related, while the second network is a manipulation detector that determines whether the query package is a manipulated version of the top-1 retrieved package. Since manipulation detection is dependent on the relatedness of the two packages, the relationship classifier controls a forget gate which scales the feature vector of the manipulation detector according to the relatedness between the two packages. In the meantime, a single package module verifies the coherency (i.e., integrity) of the query package alone. The integrity assessment module concatenated feature vectors from both related and single package modules for manipulation classification.

Fig. 8
figure 8

The package integrity assessment model of [31]

One of the main challenges for developing multimedia semantic integrity assessment methods is the lack of training and evaluation data. In light of this, Jaiswal et al. [14] proposed a novel framework, Adversarial Image Repurposing Detection (AIRD) (see Fig. 9), for image repurposing detection, which can be trained in the absence of training data containing manipulated metadata. AIRD is to simulate the real-world adversarial interplay between a bad actor who repurposes images with counterfeit metadata and a watchdog who verifies the semantic consistency between images and their accompanying metadata. More specifically, AIRD consists of two models: a counterfeiter and a detector, which are trained in an adversarial way. While the detector gathers evidence from the reference set, the counterfeiter exploits it to conjure convincingly deceptive fake metadata for a given query package.

Fig. 9
figure 9

Architecture of adversarial image repurposing detection (AIRD)

4 Challenging Problems

In the previous sections, we introduce several visual features and existing approaches based on visual content for effective fake news detection. Despite the research developments on the multimedia fake news detection problem, there are still some specific challenges that need to be considered.

One major challenge is the lacking of labeled data. Although the multimedia content is rapidly growing nowadays, datasets about multimedia fake news are scarce, which hinders the development of this research field. To tackle this challenge, on the one hand, we encourage researchers to pay more attention to constructing and releasing high-quality labeled datasets. On the other hand, it is important to study multimedia fake news detection in a weakly supervised setting, i.e., with limited or no label data for training. For example, Jin et al. [16] constructs a large-scale weakly-labeled dataset as auxiliary to overcome the data scarcity issue, and proposes a domain transferred deep CNN to detect the fake news images.

Besides, another critical challenge is the explainability of fake news detection, i.e., why a model determines a particular piece of news as fake. Although computational detection of fake news has produced some promising results, the explainability of such detection remains largely unsolved, making the judgments unconvincing. In recent years, fact-checking approaches have aroused the attention of researchers, which could offer a new way to tackle this challenge. Different from traditional style-based fake news detection, these approaches utilize external resources (also known as knowledge) as evidence to fact-check a given piece of news is fake or real. For multimedia content, the relationship between the textual and visual content and metadata is a powerful clue, which can be combined with the external knowledge to make inferences. These approaches are helpful for better understanding and explaining the decision made by algorithms according to the involved evidence and visible inference process.