Keywords

1 Introduction

The explosive use of contemporary social media in communication has witnessed the widespread of rumors which can pose a threat to the cyber security and social stability. For instance, on April 23rd 2013, a fake news claiming two explosions happened in the White House and Barack Obama got injured was posted by a hacked Twitter account named Associated Press. Although the White House and Associated Press assured the public minutes later the report was not true, the fast diffusion to millions of users had caused severe social panic, resulting in a loss of $136.5 billion in the stock marketFootnote 1. This incident of a false rumor showcases the vulnerability of social media on rumors, and highlights the practical value of automatically predicting the veracity of information.

Fig. 1.
figure 1

For social media posts regarding a specific event, i.e.,“Trump being disqualified from U.S. election”, tokens like “Donald Trump”, “Obama” and “disqualified” appear extremely frequently in disputed postings.

Debunking rumors at their formative stage is particularly crucial to minimizing their catastrophic effects. Most existing rumor detection models employ learning algorithms that incorporate a wide variety of features and formulate rumor detection into a binary classification task. They commonly craft features manually from the content, sentiment [1], user profiles [2], and diffusion patterns of the posts [3,4,5]. Embedding social graphs into a classification model also helps distinguish malicious user comments from normal ones [6, 7]. However, feature engineering is extremely time-consuming, biased, and labor-intensive. Moreover, hand-crafted features are data-dependent, making them incapable of resolving contextual variations in different posts.

Recent examinations on rumors reveal that social posts related to an event under discussion are coming in the form of time series wherein users forward or comment on it continuously over time. Meanwhile, as shown in Fig. 1, during the discussion of arbitrary topics, users’ posts exhibit high duplication in their textual phrases due to the repeated forwarding, reviews, and/or inquiry behavior [8]. This poses a challenge on efficiently distilling distinct information from duplication and timely capturing textual variations from posts.

The propagation of information on social media has temporal characteristics, whilst most existing rumor detection methodologies ignore such a crucial property or are not able to capture the temporal dimension of data. One exception is [9] where Ma et al. uses an RNN to capture the dynamic temporal signals of rumor diffusion and learn textual representations under supervision. However, as the rumor diffusion evolves over time, users tend to comment differently in various stages, such as from expressing surprise to questioning, or from believing to debunking. As a consequence, textual features may change their patterns with time and we need to determine which of them are more important to the detection task. On the other hand, the existence of duplication in textual phrases impedes the efficiency of training a deep network. In this sense, two aspects of temporal long-term characteristic and dynamic duplication should be addressed simultaneously in an early rumor detection model.

1.1 Challenges and Our Approach

In summary, there are three challenges in early rumor detection to be addressed: (1) automatically learning representations for rumors instead of using labor-intensive hand-crafted features; (2) the difficulty of maintaining the long-range dependency among variable-length post series to build their internal representations; (3) the issue of high duplication compounded with varied contextual focus. To combat these challenges, we propose a novel deep attention based recurrent neural network (RNN) for early detection on rumors, namely CallAtRumors (Call Attention to Rumors). The overview of our framework is illustrated in Fig. 2. For one event (i.e., topic) our model converts posts related to one event into feature matrices. Then, the RNN with soft attention mechanism automatically learns latent representations by feed-forwarding each input weighted by attention weights. Finally, an additional hidden layer with sigmoid activation function using the learned latent representations to classify whether this event is a rumor or not.

Fig. 2.
figure 2

Schematic overview of our framework.

1.2 Contributions

The main contributions of our work are summarized in three aspects:

  • We propose a deep attention neural network that learns to perform rumor detection automatically in earliness. The model is capable of learning continuous hidden representations by capturing long-range dependency an contextual variations of posting series.

  • The deterministic soft-attention mechanism is embedded into recurrence to enable distinct feature extraction from high duplication and advanced importance focus that varies over time.

  • We quantitatively validate the effectiveness of attention in terms of detection accuracy and earliness by comparing with state-of-the-arts on two real social media datasets: Twitter and Weibo.

2 Related Work

Our work is closely connected with early rumor detection and attention mechanism. We will briefly introduce these two aspects in this section.

2.1 Early Rumor Detection

The problem of rumor detection [10] can be viewed as binary classification tasks. The extraction and selection of discriminative features significantly affects the performance of the classifier. Hu et al. first conducted a study to analyze the sentiment differences between spammers and normal users and then presented an optimization formulation that incorporates sentiment information into a novel social spammer detection framework [11]. Also the propagation patterns of rumors were developed by Wu et al. through utilizing a message propagation tree where each node represents a text message to classify whether the root of the tree is a rumor or not [3]. In [4], a dynamic time series structure was proposed to capture the temporal features based on the time series context information generated in every rumor’s life-cycle. However, these approaches requires daunting manual efforts in feature engineering and they are restricted by the data structure.

Early rumor detection is to detect viral rumors in their formative stages in order to take early action [12]. In [8], some very rare but informative enquiry phrases play an important role in feature engineering when combined with clustering and a classifier on the clusters as they shorten the time for spotting rumors. Manually defined features has shown their importance in the research on real-time rumor debunking by Liu et al. [5]. By contrast, Wu et al. proposed a sparse learning method to automatically select discriminative features as well as train the classifier for emerging rumors [13]. As those methods neglect the temporal trait of social media data, a time-series based feature structure [4] is introduced to seize context variation over time. Recently, recurrent neural network was first introduced to rumor detection by Ma et al. [9], utilizing sequential data to spontaneously capture temporal textual characteristics of rumor diffusion which helps detecting rumor earlier with accuracy. However, without abundant data with differentiable contents in the early stage of a rumor, the performance of these methods drops significantly because they fail to distinguish important patterns.

2.2 Attention Mechanism

As a rising technique in natural language processing problems [14, 15] and computer vision tasks [16,17,18], attention mechanism has shown considerable discriminative power for neural networks. For instance, Bahdanau et al. extended the basic encoder-decoder architecture of neural machine translation with attention mechanism to allow the model to automatically search for parts of a source sentence that are relevant to predicting a target word [19], achieving a comparable performance in the English-to-French translation task. Vinyals et al. improved the attention model in [19], so their model computed an attention vector reflecting how much attention should be put over the input words and boosted the performance on large scale translation [20]. In addition, Sharma et al. applied a location softmax function [21] to the hidden states of the LSTM (Long Short-Term Memory) layer, thus recognizing more valuable elements in sequential inputs for action recognition. In conclusion, motivated by the successful applications of attention mechanism, we find that attention-based techniques can help better detect rumors with regards to both effectiveness and earliness because they are sensitive to distinctive textual features.

3 CallAtRumors: Early Rumor Detection with Deep Attention Based RNN

In this section, we present the details of our framework with deep attention for classifying social textual events into rumors and non-rumors.

3.1 Problem Statement

Individual posts contain very limited content due to their nature of shortness in context. On the other hand, an event is generally associated with a number of posts making similar claims. These related posts can be easily collected to describe an event more faithfully. Hence, we are interested in detecting rumor on an aggregate (event) level instead of identifying each single posts [9], where sequential posts related to the same topics are batched together to constitute an event, and our model determines whether the event is a rumor or not.

Let \(\varvec{E} = \{E_i\}\) denote a set of given events, where each event \(E_i=\{(p_{i,j},t_{i,j})\}_{j=1}^{n_i}\) consists of all relevant posts \(p_{i,j}\) at time stamp \(t_{i,j}\), and the task is to classify each event as a rumor or not.

3.2 Constructing Variable-Length Post Series

Algorithm 1 describes the construction of variable-length post series. To ensure a similar word density for each time step within one event, we group posts into batches according to a fixed post amount N rather than slice the event time span evenly. Specifically, for every event \(E_i=\{(p_{i,j},t_{i,j})\}_{j=1}^{n_i}\), post series are constructed with variable lengths due to different amount of posts relevant to different events. We set a minimum series length Min to maintain the sequential property for all events.

figure a

To model different words in the post series, we calculate the tf-idf for the most frequent K vocabularies within all posts. Finally, every post is encoded by the corresponding tf-idf vector, and a matrix of \({K}{\times }{N}\) for each time step can be constructed as the input of our model. If there are less than N posts within an interval, we will expand it to the same scale by padding with 0s. Hence, each set of post series consists of at least Min feature matrices with a same size of K (number of vocabularies) \({\times }\) N (vocabulary feature dimension).

3.3 Long Short-Term Memory (LSTM) with Deterministic Soft Attention Mechanism

To capture the long-distance temporal dependencies among continuous time post series, we employ following Long Short-Term Memory (LSTM) unit which plays an important role in language sequence modelling and time series processing [22,23,24,25,26] to learn high-level discriminative representations for rumors:

$$\begin{aligned} \begin{aligned}&i_t = \sigma ({U_i}{h_{t-1}} + {W_i}{x_t} + {V_i}{c_{t-1}} + b_i), \\&f_t = \sigma ({U_f}{h_{t-1}} + {W_f}{x_t} + {V_f}{c_{t-1}} + b_f), \\&c_t = f_tc_{t-1} + i_t\tanh ({U_c}{h_{t-1}} + {W_c}{x_t} + b_c), \\&o_t = \sigma ({U_o}{h_{t-1}} + {W_o}{x_t} + {V_o}{c_t} + b_o), \\&h_t = o_t\tanh (c_t), \\ \end{aligned} \end{aligned}$$
(1)

where \(\sigma (\cdot )\) is the logistic sigmoid function, and \(i_t\), \(f_t\), \(o_t\), \(c_t\) are the input gate, forget gate, output gate and cell input activation vector, respectively. In each of them, there are corresponding input-to-hidden, hidden-to-output, and hidden-to-hidden matrices: \(U_{\bullet }\), \(V_{\bullet }\), \(W_{\bullet }\) and the bias vector \(b_{\bullet }\).

In Eq. (1), the context vector \(x_t\) is a dynamic representation of the relevant part of the social post input at time t. To calculate \(x_t\), we introduce an attention weight \(a_t[i],i=1,\ldots ,K\), corresponding to the feature extracted at different element positions in a tf-idf matrix \(d_t\). Specifically, at each time stamp t, our model predicts \(a_{t+1}\), a softmax over K positions, and \(y_t\), a softmax over the binary class of rumors and non-rumors with an additional hidden layer with \(sigmoid(\cdot )\) activations (see Fig. 3(c)). The location softmax [21] is thus, applied over the hidden states of the last LSTM layer to calculate \(a_{t+1}\), the attention weight for the next input matrix \(d_{t+1}\):

$$\begin{aligned} a_{t+1}[i] = P(L_{t+1}=i|h_t) = \frac{e^{{W_i}^\top {h_t}}}{\sum _{j=1}^{K}e^{W_j^{\top }h_t}} \qquad i \in 1,\ldots ,K, \end{aligned}$$
(2)

where \(a_{t+1}[i]\) is the attention weight for the i-th element (word index) at time step \(t+1\), \(W_i\) is the weight allocated to the i-th element in the feature space, and \(L_{t+1}\) represents the word index and takes 1-of-K values.

Fig. 3.
figure 3

(a) The attention module computes the current input \(x_t\) as an average of the tf-idf features weighted according to the attention softmax \(a_t\). (b) At each time stamp, the proposed model takes the feature slice \(x_t\) as input and propagates \(x_t\) through stacked layers of LSTM and predicts next location weight \(a_{t+1}\). The class label \(y_t\) is calculated at the last time step t.

The attention vector \(a_{t+1}\) consists of K weight scalars for each feature dimension, representing the importance attached to each word in the input matrix \(d_{t+1}\). Our model is optimized to assign higher focus to words that are believed to be distinct in learning rumor/non-rumor representations. After calculating these weights, the soft deterministic attention mechanism [19] computes the expected value of the input at the next time step \(x_{t+1}\) by taking weighted sums over the word matrix at different positions:

$$\begin{aligned} x_{t+1} = \mathbb {E}_{P(L_{t+1}|h_t)}[d_{t+1}] = \sum _{i=1}^{K}a_{t+1}[i] d_{t+1}[i], \end{aligned}$$
(3)

where \(d_{t+1}\) is the input matrix at time step \(t+1\) and \(d_{t+1}[i]\) is the feature vector of the i-th position in the matrix \(d_{t+1}\). Thus, Eq.(3) formulates a deterministic attention model by computing a soft attention weighted word vector \(\sum _i a_{t+1}[i] d_{t+1}[i]\). This corresponds to feeding a soft-a-weighted context into the system, whilst the whole model is smooth and differential under the deterministic attention, and thus learning end-to-end is trivial by using standard back-propagation.

3.4 Loss Function and Model Training

In model training, we employ cross-entropy loss coupled with l2 regularization. The loss function is defined as follows:

$$\begin{aligned} \mathcal {L}=-\sum _{c=1}^C y_{t,c} \log \hat{y}_{t,c} + \gamma \phi ^2, \end{aligned}$$
(4)

where \(y_t\) is the one hot label represented by 0 and 1, \(\hat{y}_t\) is the predicted binary class probabilities at the last time step t, \(C=2\) is the number of output classes (rumors or non-rumors), \(\gamma \) is the weight decay coefficient, and \(\phi \) represents all the model parameters.

The cell state and the hidden state for LSTM are initialized using the input tf-idf matrices for faster convergence:

$$\begin{aligned} \begin{aligned} c_0=f_c \left( \frac{1}{\tau }\sum _{t=1}^{\tau } \left( \frac{1}{K} \sum _{i=1}^K d_t[i]\right) \right) , \\ h_0=f_h \left( \frac{1}{\tau }\sum _{t=1}^{\tau } \left( \frac{1}{K} \sum _{i=1}^K d_t[i]\right) \right) , \end{aligned} \end{aligned}$$
(5)

where \(f_c\) and \(f_h\) are two multi-layer perceptrons, and \(\tau \) is the number of time steps for each event sequence. These values are used to compute the first location softmax \(a_1\) which determines the initial input \(x_1\).

4 Experiments

In this section, we evaluate the performance of our proposed methodology in early rumor detection using real-world data collected from two different social media platforms.

4.1 Datasets

We use two public datasets published by [9]. The datasets are collected from TwitterFootnote 2 and Sina WeiboFootnote 3 respectively. Both of the datasets are organised at event-level with the ground truth verified via SnopesFootnote 4 and Sina Community Management CenterFootnote 5. In addition, we follow the criteria from [9] to manually gather 4 non-rumors from Twitter and 38 rumors from Weibo for comprehensive class balancing. Note for Tweet datasets, some posts are no longer available when we crawled those tweets, causing a 10% shrink on the scale of data compared with the original Twitter dataset and this is a main cause for a slight performance fluctuation compared with the results in other papers.

Table 1 gives statistical details of the two datasets. We observe that more than 76% of the users tend to repost the original news with very short comments to reflect their attitudes towards those news. As a consequence, the contents of the posts related to one event are mostly duplicate, which can be rather challenging for early rumor detection tasks.

Table 1. Statistical details of datasets. PPE stands for posts per event.

4.2 Settings and Baselines

The model is implemented using TensorflowFootnote 6. All parameters are set using cross-validation. To generate the input variable-length post series, we set the amount of posts N for each time step as 5 and the minimum post series length Min as 2. We selected K = 10,000 top words for the construction tf-idf matrices. We randomly split our datasets with the ratio of 70%, 10% and 20% for training, validation and test respectively. We apply a three-layer LSTM model with descending amount of hidden states (specifically 1,024, 512 and 128). The learning rate is set as 0.001 and the \(\gamma \) is set to be 0.005. Our model is trained through back-propagation [27] algorithm, namely Adam [28]. We iterate the whole training process until the loss value converges.

We evaluate the effectiveness and efficiency of CallAtRumors by comparing with the following state-of-the-art approaches in terms of precision and recall:

  • DT-Rank [8]: This is a decision-tree based ranking model using enquiry phrases which is able to identify trending rumors by recasting the problem as finding entire clusters of posts whose topic is a disputed factual claim.

  • SVM-TS [4]: SVM-TS can capture the temporal characteristics of from contents, users and propagation patterns based on the time series of rumors’ lifecycle with time series modelling technique applied to incorporate carious social context information.

  • LK-RBF [12]: We choose this link-based approach and combine it with the RBF (Radial Basis Function) kernel as a supervised classifier because it achieved the best performance in their experiments.

  • ML-GRU [9]: This method utilizes basic recurrent neural networks for early rumor detection. Following the settings in their work, we choose the multi-layer GRU (gated recurrent unit) as it performs the best in the experiment.

  • CERT [13]: This is a cross-topic emerging rumor detection model which can jointly cluster data, select features and train classifiers by using the abundant labeled data from prior rumors to facilitate the detection of an emerging rumor.

4.3 Effectiveness and Earliness Analysis

In this experiment, we take different ratios of the posts starting from the first post within all events for model training, ranging from 10% to 80% in order to test how early CallAtRumors can detect rumors successfully when there are limited amount of posts available. Through incrementally adding training data in the chronological order, we are able to estimate the time that our method can detect emerging rumors. The results on earliness are shown in Fig. 4. At the early stage with 10% to 60% training data, CallAtRumors outperforms four comparative methods by a noticeable margin. In particular, compared with the most relevant method of ML-GRU, as the data proportion ranging from 10% to 20%, CallAtRumors outperforms ML-GRU by 5% on precision and 4% on recall on both Twitter and Weibo datasets. The result shows that attention mechanism is more effective in early stage detection by focusing on the most distinct features in advance. With more data applied into test, all methods are approaching their best performance. For Twitter dataset and Weibo Dataset with highly noticable duplicate contents in each event, our method starts with 74.02% and 71.73% in precision while 68.75% and 70.34% in recall, which means an average time lag of 20.47 h after the emerge of one event. This result is promising because the average report time over the rumors given by Snopes and Sina Community Management Center is 54 h and 72 h respectively [9], and we can save much manual effort with the help of our deep attention based early rumor detection technique.

Apart from numerical results, Fig. 4(e) visualises the varied attention effects on a detected rumor. Different color degrees reflect various attention degrees paid to each word in a post. In the rumor “School Principal Eujin Jaela Kim banned the Pledge of Allegiance, Santa and Thanksgiving”, most of the vocabularies closely connected with the event itself are given less attention weight than words expressing users’ doubting, esquiring and anger caused by the rumor. Despite the massive duplication from users’ comments, by implementing textual attention mechanism, CallAtRumors is able to lay more emphasis on discriminative words, thus guaranteeing high performance in such case.

Fig. 4.
figure 4

The charts in (a)–(d) reveal the performance for all methods with accumulative training data size. The effect of attention mechanism is visualized via (e). (Color figure online)

5 Conclusion

Rumor detection on social media is time-sensitive because it is hard to eliminate the vicious impact in its late period of diffusion as rumors can spread quickly and broadly. In this paper, we introduce CallAtRumors, a novel recurrent neural network model based on soft attention mechanism to automatically carry out early rumor detection by learning latent representations from the sequential social posts. We conducted experiments with five state-of-the-art rumor detection methods to illustrate that CallAtRumors is sensitive to distinguishable words, thus outperforming the competitors even when textual feature is sparse at the beginning stage of a rumor. In our future work, it would be appealing to investigate more complexed feature from opinion clustering results [29] and user behavior patterns [30] with our deep attention model to further improve the early detection performance.