Keywords

1 Introduction

With the rapid development of the Internet, social media has become an extremely popular source for individuals to obtain news information and express their opinions. According to the statistics in Digital 2022Footnote 1, social media users have escalated to 4.62 billion accounting for 58.4% of the global population, with an annual growth rate exceeding 10%. For instance, Facebook has emerged as the most influential social media platform, with a worldwide registration of 2.96 billion individualsFootnote 2. At the same time, the rapid growth of social media platforms has facilitated the dissemination of rumors which can cause serious social panic [1].

Cantonese originated in Guangdong Province of China, and is presently one of the most extensively spoken languages by native and overseas Chinese. With over 82.4 million speakersFootnote 3, it has spread throughout overseas Chinese communities. The rapid dissemination of Cantonese rumors through social media has resulted in serious harm to society. Therefore, it is essential to develop an intelligent method to identify rumors automatically.

Most existing rumor detection works focus on detecting rumors in Chinese and English [1, 13, 16, 19, 21], but there are few works considering Cantonese rumors [3, 7, 10]. They use the textual information of tweets and feature engineering for Cantonese rumor detection [3, 7]. However, Cantonese rumor detection still faces two critical challenges that still need to be resolved. Firstly, existing studies in Cantonese rumor detection mainly rely on text and user statistic features [3, 7]. These approaches don’t take into account the dissemination structures of retweets and comments, and there is no available structural Cantonese rumor dataset that includes source tweets, retweets, and comments. Therefore, the absence of a benchmark Cantonese rumor dataset makes the detection of Cantonese rumors a daunting task. Secondly, despite the successful works achieved by current rumor detection methods, external knowledge graphs are rarely integrated into their existing methods. Existing knowledge graph-based approaches mainly extract the structural triplets (head, relation, tail) from the tweet text to compare with faithful triples from knowledge graph to predict the truthfulness of tweets [5, 14]. But they don’t consider the information in the retweets, and comments and don’t embed external knowledge into source tweets to assist in the judgment of rumors. Meanwhile, existing research has applied Graph Convolutional Networks for detecting rumors, but there is no research on combining external knowledge with graph convolutional networks for rumor detection.

To address the above two challenges in Cantonese rumor detection, firstly, based on Facebook, we build and label a structural Cantonese rumors benchmark dataset, including source tweets, retweets, and comments. Secondly, we propose a novel feature extraction method based on Graph Convolutional Networks for external knowledge integration. This approach involves constructing a heterogeneous graph using official statements and Wikipedia entity descriptions and utilizing BERT to capture the text features of the source tweet. To generate correlation vectors, a Comparison Network is employed to compare the external knowledge with source text. The Bidirectional Graph Convolutional Networks is then utilized to extract top-down and bottom-up propagation features of tweets during the rumor propagation process. Finally, these three types of features are fused to train a Cantonese rumor detection model. The main contributions of this paper are summarized as follows:

  • To the best of our knowledge, we are the first to construct a structural Cantonese rumor dataset containing source tweets, retweets and comments in social networks, which is publicly available on GitHubFootnote 4. Specifically, we crawled 2,721 source tweets and 91,246 comments and retweets. After data cleaning and labeling, a Cantonese rumor dataset was constructed including 1,925 source tweets, 64,221 comments and retweets.

  • We construct a novel method for extracting external knowledge features based on Graph Convolutional Networks and obtaining correlation features through Comparison Network. First, we construct a heterogeneous knowledge graph based on official statements and entity descriptions from Wikipedia. Then, we use a Heterogeneous Graph Convolutional Networks to extract the embedded features of external knowledge. Finally, we obtain correlation vectors from external knowledge embedded and tweet text features by Comparison Network.

  • We propose a novel Cantonese rumor detection model BGEK for Cantonese rumor detection, which integrates the text features, comparison features, and structural features of tweets. To the best of our knowledge, we are the first to apply external knowledge to Cantonese rumor detection. Our experimental results demonstrate that the BGEK detection model achieves remarkable detection results and outperforms other state-of-the-art baseline models.

2 Related Work

Traditional rumor detection methods are based on machine learning and they mainly focus on text and statistical features, aiming to train a classifier for rumor detection through supervised learning. A series of methods have been proposed [2, 9, 20], such as Random Forest [9], Decision Tree [2] and Support Vector Machine (SVM) [20]. The above methods rely heavily on feature engineering, which takes plenty of time, and the detection performance is not ideal.

In order to automatically extract features of rumors, a series of methods based on deep learning have been proposed [12, 13, 15, 17, 21]. They mostly use text features for rumor detection, for example, Recurrent Neural Network (RNN) [12], Convolutional Neural Network (CNN) [21], Structure Recurrent Neural Network (RvNN) [13]. Song et al. [15] proposed an adversarial awareness rumor detection framework. Sun et al. [17] applied contrastive learning to rumor detection. However, the currently available methods prove to be inefficient in their ability to learn the features of the propagation structure, and they exhibit a lack of consideration for the global structural characteristics of rumor dispersion.

Compared with deep learning-based rumor detection mentioned above, Graph Convolutional Networks [8] has been applied in the field of rumor detection in recent years [1, 11, 19]. Bian et al. [1] used the Graph Convolutional Networks for the first time in the social network rumor detection field. Wei et al. [19] proposed an Edge-enhanced Bayesian Graph Convolutional Networks to obtain robust node feature representations. Lu et al. [11] proposed a Graph-aware Co-Attention Network (GCAN) for interpretable disinformation detection.

In recent studies, the significance of external knowledge for rumor detection has been recognized, leading to the development of rumor detection models that incorporate knowledge graph enhancement [5, 6, 14, 16]. The methods [5, 14] utilized tuples (head, relationship, tail) from the text to compare with equivalent elements in knowledge graphs. They mainly extracted the structural triplets (head, relation, tail) from the tweet text to compare with faithful triples from knowledge graph to predict the truthfulness of tweets.

Existing research on rumor detection in Cantonese [3, 7, 10] utilizes text and user information for rumor detection. Our previous work [3, 7] combined Cantonese tweets’ semantic and statistical features to detect Cantonese rumors on the Twitter platform. Lin et al. [10] proposed an annotation system that facilitates manual fact-checking. Existing Cantonese rumor detection methods mainly focus on text information and statistical features and do not use external knowledge graphs which can provide corresponding evidence for rumor detection. They also don’t utilize the structural characteristics of the propagation and dispersion of rumors and limit the improvement of detection performance.

3 Methodology

In this section, we will describe the Cantonese rumor detection model based on the Bidirectional Graph Convolutional Network embedding external knowledge.

3.1 Dataset Construction

Existing Cantonese rumor datasets [3, 7] mainly focus on source tweets and user information. Deep learning models can be built based on these existing datasets. However, there is no structural Cantonese rumor dataset including retweets and comments which is of great significance for the research of Graph Neural Networks based Cantonese rumor detection. Facebook has become one of the essential social media websites for users to obtain news information and a large number of Cantonese rumors spread on Facebook. So we choose Facebook social platform as the research object and construct a wholly structural Cantonese rumor dataset. To obtain the structural information, the Selenium framework is utilized to crawl data due to the Facebook’s restrictions on data crawling. Then the data is manually labeled according to the method of [3]. After data cleaning and screening, we finally construct a Cantonese rumor dataset, named Facebook-C-Dataset, which contains 1,925 source tweets, 64,221 comments and retweets. The tweets are further classified into three major domains, namely society, health, and information technology, and 49 specific topics, such as cancer, chronic diseases, radiofrequency radiation, and COVID-19 vaccine. Table 1 shows the detailed composition of the dataset.

Table 1. Statistics of the Facebook-C-Dataset

3.2 BGEK Rumor Detection Model

The Cantonese rumor detection model BGEK is shown in Fig. 1. In order to better utilize external facts, we use a Comparison Network to embed external knowledge into the text representation of tweets. At the same time, we use a Bidirectional Graph Convolutional Networks, which fully uses the contribution of retweets and comments generated in the process of tweet dissemination to rumor detection. The specific implementation details are as follows:

Tweet Propagation Graph Construction. Given the source tweet, retweet, and comment information of a event, we can represent the total rumor dataset as \(C = \{ {c_1},{c_2},...,{c_m}\} \), where \({c_i}\) denotes the \(i - th\) event and m is the number of events. \({c_i}\) can be denoted as \({c_i} = \{ {r_i},t_1^i,t_2^i,...,t_{{n_i} - 1}^i,{G_i}\} \), where \(t_j^i\) denotes the \(j - th\) responsive tweet, \({n_i}\) denotes the total number of comments and retweets contained in the event \({c_i}\), \({G_i}\) represents the rumor propagation graph composed of event \({c_i}\). \({G_i}\) is defined as \({G_i} = {<} {V_i},{E_i} {>} \), where the node set is \({V_i} = \{ {r_i},t_1^i,t_2^i,...,t_{{n_i} - 1}^i\} \), \({r_i}\) is the root node in the propagation graph, and \({E_i} = \{ e_{st}^i|s,t = 0,1,...,{n_i} - 1\} \) is the edge set. Each \(e_{st}^i\) represents the directed relationship among tweets, retweets, and comments. For an adjacency matrix \({A_i} \in \mathbb {R} {^{{n_i} \times {n_i}}}\) , the initial value can be calculated as:

$$\begin{aligned} {\ a_{st}^i = \left\{ {\begin{array}{*{20}{c}} {1,{} {} {} \quad \quad \,{} {} if\quad e_{st}^i \in E}\\ {0,\quad \quad \,\,{} otherwise\quad } \end{array}} \right. }. \end{aligned}$$
(1)

For each event \({c_i}\), there is a corresponding label \({y_i} \in Y\), where Y represents different categories of events, our goal is to train a classifier \(f:C \rightarrow Y\).

Structural Feature Extraction. Based on the relationship between source tweets, retweets, and comments, we constructed a propagation graph \({G_i} = {<} {V_i},{E_i} {>} \) for each event \({c_i}\) and then built the adjacency matrix \({A_i} \in \mathbb {R} {^{{n_i} \times {n_i}}}\). We constructed text features \({x_i}\) for each node in the graph, and the feature matrix can be represented as \(X = \{ {x_1},{x_2},{x_3},...,{x_{{n_i}}}\} \), where \({n_i}\) represents the total number of comments and retweets in the event \({c_i}\). We used a Bidirectional Graph Convolutional Networks (Bi-GCN) to calculate the node representations in the graph, which includes a top-down Graph Convolutional Networks (TD-GCN) and a bottom-up Graph Convolutional Networks (BU-GCN). The adjacency matrices for TD-GCN and BU-GCN can be represented as \({A^{TD}} = A\) and \({A^{BU}} = {A^T}\) respectively. The top-down and bottom-up propagation features can be obtained by two layers of GCN as follows:

Fig. 1.
figure 1

BGEK rumor detection model

$$\begin{aligned} {H_1^{BU} = \sigma ({\tilde{A}^{BU}}XW_0^{BU})} \end{aligned}$$
(2)
$$\begin{aligned} {H_2^{BU} = \sigma ({\tilde{A}^{BU}}H_1^{BU}W_1^{BU})} \end{aligned}$$
(3)

where \({\tilde{A}^{BU}}\) is the regularized adjacency matrix of \({A^{BU}}\), \(H_1^{BU}\), \(H_2^{BU}\) and \(W_0^{BU}\), \(W_1^{BU}\) are the hidden features and weight matrix respectively, and \(\sigma \) is the activation function. Similarly, the top-down hidden features \(H_1^{TD}\), \(H_2^{TD}\) can be obtained by the above equation. Meanwhile, in order to make full use of the features of the source tweets, we concatenate the root node features of the \(k - 1\) layer with the hidden layer features of the k layer as follows:

$$\begin{aligned} {\tilde{H}_k^{BU} = concat(H_k^{BU},{(H_{k - 1}^{BU})^{root}})} \end{aligned}$$
(4)

Through the propagation and dispersion features \(\tilde{H}_2^{TD}\), \(\tilde{H}_2^{BU}\) obtained above, the propagation features and dispersion features can be connected to obtain the structural features:

$$\begin{aligned} {\ T = concat(\tilde{H}_2^{TD},\tilde{H}_2^{BU})} \end{aligned}$$
(5)

External Knowledge Extraction

External Knowledge Graph Construction. We construct a heterogeneous graph \(\omega = <V,E>\) for the types of source tweets, including official statements and entity descriptions. The graph contains two different types of nodes: official statements \(R = \{ V_1^r,V_2^r,V_3^r,...,V_x^r\} \) and entity descriptions \(D = \{ V_1^d,V_2^d,V_3^d,...,V_y^d\} \), where x represents the number of official statements and y represents the number of entity descriptions. And the set of edges E includes bidirectional connections and unidirectional connections. The construction method for the external knowledge graph is outlined as follows:

The source tweet contains M-specific aspects, which can be expressed as \(S = \{ {s_1},{s_2},{s_3},...,{s_M}\} \). The source tweet may belong to multiple aspects, and the content of tweets under the same aspect has a particular content similarity. First, we bidirectionally connect the official statements constructed under each aspect. Then for the entities contained in the official statement and the source tweet, we connect the entity to the entry on Wikipedia and select the content of the first paragraph as the entity description. Because the entity description and the official statement are related, we link the official statement and the entity description under the same aspect bidirectionally. Considering that the official statement corresponding to the same type of aspect has a certain similarity, we link the same type of official statement bidirectionally. Because an original tweet may belong to multiple aspects, we create an undirected connection edge between entity descriptions of the same type and other entity descriptions.

Heterogeneous Graph Convolutional Networks Construction. Through the directed heterogeneous graph \(\omega = {<}V,E{>}\) constructed above, we use a Heterogeneous Graph Convolutional Networks to represent and learn official statements and entity descriptions. We use the Cantonese corpus we constructed to fine-tune the BERT model based on Chinese pre-training. Then we use BERT to obtain the node embedding feature matrix \(X \in \mathbb {R} {^{|V| \times D}}\), where \(X = \{ {x_1},{x_2},{x_3},...,{x_{|V|}}\} \) includes the features of all nodes on the heterogeneous graph and \({x_i}\) represents the feature of the \(i - th\) node. We define A as the adjacency matrix and D as the degree matrix. Then the heterogeneous graph convolutional layer updates the \(i + 1 - th\) layer clustering features by clustering the features of the \(i - th\) layer adjacency matrix:

$$\begin{aligned} {\ A' = {D^{ - \frac{1}{2}}}(A + I){D^{ - \frac{1}{2}}}} \end{aligned}$$
(6)
$$\begin{aligned} {\ {H^{(i + 1)}} = \sigma (A'{H^i}{W^i})} \end{aligned}$$
(7)

where I is the identity matrix of |V| dimension, \(A'\) is the adjacency matrix after self-connection and regularization, \({W^i}\) is the weight matrix of \(i - th\) layer, \({H^i}\) is the feature matrix of \(i - th\) layer, \(\sigma \) is the activation function.

Comparison Feature Extraction. We get the embedded representation of external knowledge \(K = \{ {k_1},{k_2},{k_3},...,{k_{|V|}}\} \) through the above-mentioned Heterogeneous Graph Convolutional Networks. The text of the source tweet can be expressed as

$$\begin{aligned} {\ T = \{ {t_1},{t_2},{t_3},...,{t_{|C|}}\}} \end{aligned}$$
(8)

where |C| represents the number of source tweets in the dataset. We fine-tuned the BERT model based on Chinese pre-training through the constructed Cantonese corpus, and then the text features can be obtained through the features as follows:

$$\begin{aligned} {\ B = BERT(T)\ }. \end{aligned}$$
(9)

where \(B = \{ {b_1},{b_2},{b_3},...,{b_{|C|}}\} \) is the text feature of the source tweet, and then we get the comparison vector by comparing the text feature \({b_n}\) of the source tweet with the knowledge embedding feature \({k_n}\):

$$\begin{aligned} {\ {C_n} = {f_{cmp}}({b_n},{k_n})} \end{aligned}$$
(10)

where \({f_{cmp}}()\) is the comparison function. Based on [6], we designed the comparison function as \({f_{cmp}}(x,y) = W[x - y,x \odot y]\). Where W is the dimension transformation matrix, x and y are the text features of the source tweets and knowledge embedding feature vectors, and \( \odot \) represents the Hadamard product.

Feature Concatenation. Initially, we perform concatenation of various feature sets including source tweet text features \({B_n}\), comparison features \({C_n}\) and structural features \({T_n}\) based on retweets and comments to obtain vector

$$\begin{aligned} {\ {F_n} = concat({B_n} + {C_n} + {T_n}) \in \mathbb {R} {^{|{B_n}| + |{C_n}| + |{T_n}|}} } \end{aligned}$$
(11)

Then \({F_n}\) is subsequently fed as input to the Softmax layer, which can be represented as

$$\begin{aligned} {\ Z = Softmax (W{F_n} + b) } \end{aligned}$$
(12)

Here, W is the parameter matrix of the fully connected layer, and b is the bias matrix of the fully connected layer.

4 Experiments

4.1 Experiment Settings

In our work, all experiments are conducted on an NVIDIA A100-SXM4 workstation with 80G of memory. The dataset used in the experiments is the Facebook-C-Dataset. To extract structural features, TF-IDF scores are applied to the top 5,000 words in the source tweets, retweets, and comments. In evaluating the model’s performance, accuracy, precision, recall, and F1 scores are used, along with ten-fold cross-validation, to provide an average evaluation metric value.

4.2 Performance Comparison with Baselines

To assess the efficacy of our proposed BGEK model, we employ a comparative analysis approach with various state-of-the-art baselines. The results of different models for Cantonese rumor detection are shown in Table 2. Text-based features are represented by T, external knowledge-based features are represented by K, and structural features are denoted by S. Among the baseline models, BERT outperforms other models with only text information, which means text features of source tweets are important for identifying rumors. Our proposed BGEK model integrates the text features, comparison features, and structural features and achieves the best performance in terms of all metrics.

Table 2. Results of comparison among different models in Facebook-C-Dataset

4.3 Ablation Experiment

Our proposed BGEK model integrates external knowledge features, text features, and propagation structure features. To assess the individual contribution of each feature on the BGEK model, we conducted ablation experiment using the variants shown in Table 3. T represents the text features extracted from the source tweets, K represents the external knowledge features, and S represents the structural features originating from the source tweets, retweets and comments.

Table 3. The description of different variants

The evaluation metrics obtained from the experiment on different variants are shown in Fig. 2. The results shows all variants perform worse than complete BGEK model integrating the text features, comparison features, and structural features. External knowledge feature plays the most important role in the identification of Cantonese rumors, which also illustrates the necessity of external knowledge information for better performance.

Fig. 2.
figure 2

Results of ablation experiment among different variants

Fig. 3.
figure 3

Visualizing embeddings among different models

4.4 Embedding Visualization

To visually represent the feature embedding, Fig. 3 exhibits the embedding outcomes of diverse baselines on the Facebook-C-Dataset. From the plots, we find that our proposed BGEK model outperforms the other methods, by effectively segregating rumor and non-rumor information.

4.5 Robustness Experiment

In this experiment, a randomized portion of labels from the training set is intentionally mislabeled at varying ratios ranging from 5% to 45%. Following this, the model is re-trained on the modified training set and tested for its ability to withstand noise at different levels. The experimental results are shown in Fig. 4, which show that as the noise rate increases, the performance of all models (F1 score) decreases. Notably, our proposed model exhibits the most robust performance, displaying the smallest decline compared to other baseline models.

Fig. 4.
figure 4

Results of robustness experiment

Table 4. Results of comparison among different models in Twitter15 and Twitter16

4.6 Transferability Experiments on Twitter15 and Twitter16

To validate the efficacy of our proposed BGEK model for detecting rumors, we conducted comparative experiments on the Twitter15 and Twitter16 datasets using the aforementioned baselines.

Each event is labeled as Non-rumor (NR), False Rumor (F), True Rumor (T), or Unverified Rumor (U). Due to the lack of external knowledge graphs, our proposed model does not use comparison features. The experimental outcomes are presented in Table 4. Among all baseline models, BERT achieves the best accuracy in Twitter15 and EBGCN achieves the best accuracy in Twitter16. By fusing text features and structural features, our model achieves the best detection results. And it unequivocally shows the generalization ability of our model.

5 Conclusion

In this paper, we construct a structural Cantonese rumor dataset based on the Facebook platform. Additionally, we propose a novel approach for extracting embedding features that incorporate official statements and entity descriptions of external knowledge by utilizing a Heterogeneous Graph Convolutional Networks. A Comparison Network is proposed to generate comparison features by comparing external knowledge with source tweets. Subsequently, we propose a novel Cantonese rumor detection model named BGEK (Bidirectional Graph Convolutional Networks Embedded with External Knowledge) that integrates text features, comparison features, and structural features. Five main experiments are conducted to evaluate the performance of our proposed BGEK model and experimental results demonstrate that our proposed model outperforms other state-of-the-art models.