1 Introduction

Social media has become a major way of news consumption mainly because it is free and easy to access, and can rapidly spread posts. Therefore, it is an excellent way for individuals to obtain and publish various kinds of information [13,14,15]. However, the quality of news works on social media is often lower than that of traditional news sources because the contents on social media cannot be effectively supervised [2, 10, 15]. In other words, social media also allows fake news to extensively spread. Especially recently, the false information about the new coronavirus disease 2019 (COVID-19) has spread like a virus around the world. The state of the Internet is forcing us to take unprecedented actions to protect the “information health” of the public [15, 16].

It is important but challenging to find out wrong information on social media partially because human eyes are not able to distinguish true news from fake ones [11]. To facilitate the study of fake news, researchers have presented many fake news datasets such as BuzzFeedNews, LIAR [18], CREDBANK [7], BuzzFace [12], FacebookHoax [17], and FakeNewsNet [13,14,15], which contain the linguistic and social context features of social media content. Despite the existence of multiple fake news datasets, a comprehensive and effective computational solution for detecting fake news has become one of the major obstacles.

Although there are many fake news data sets available, a comprehensive and effective algorithm for detecting fake news has become one of the major obstacles. The existing research on false news detection can be roughly divided into two categories, namely, supervised learning methods based on machine learning [1, 3, 5, 8, 9], and supervised learning methods based on deep learning [4, 6, 19,20,21]. These models have achieved some results in various false news detection datasets. Shu et al. [15] applied standard machine learning models including support vector machines (SVM), logistic regression (LR), Naive Bayes (NB), and Convolutional Neural Network (CNN), provides similar results around 58% to 63%. Jwa [6] applied the Bidirectional Encoder Representations from Transformers model (BERT) model to detect fake news with a accuracy around 75%. An ensemble learning model combining four different models is proposed for fake news detection, and a higher accuracy of 72.3% is obtained in [4].The accuracy score obtained by FakeDetector with Deep Diffusive Network Model (DDNM) in [21] is 0.63. A model named as TI-CNN (Text and Image information based Convolutinal Neural Network) is proposed in [20]. By projecting the explicit and latent features into a unified feature space, TI-CNN is trained with both the text and image information simultaneously. However, most of methods around fake news detection threats it a supervised learning problem: given an existing dataset of fake news, train a classifier such that it can accurately predict the authenticity of news. In fact, annotated datasets are rare and hard to obtain as fake news circulates through websites. In addition, supervised learning model cannot achieve self-learning as it ignores the correlation between real and false data.

Therefore, learning from some state-of-the-art methods, our work aims to study a self-learning semi-supervised deep learning network that trains supervised and unsupervised tasks simultaneously to detect fake news on social media, and compare the results with existing supervised learning methods. Specifically, the work of this thesis is as follows: 1. design a semi-supervised deep learning network that simultaneously trains supervised and unsupervised tasks using modified deep learning machines; 2. make it possible to automatically add highly accurate unlabeled data to the training set and continuously expand the training set in the multi-iterative training process to achieve self-learning; 3. Compared with existing machine learning methods and deep learning networks, especially in cases of incomplete annotated training datasets or relatively small datasets, the performance of our method has been improved. In particular, its performance has been improved by 10% compared with that of neutral networks and even more compared with that of machine learning methods.

2 Self-learning semi-supervised deep learning model

Figure 1 shows the workflow of our paper, a) data collection process in this paper, b) semi-supervised self-leaning deep learning model which simultaneously trains supervised and unsupervised tasks using a modified deep learning machine L. The former involves training a supervised learning machine that requires only a small portion of labeled data, while the latter predicts the remaining unlabeled parts and returns a highly confident pseudo label of unlabeled data to enrich labeled datasets.

Fig. 1
figure 1

The workflow of our paper

2.1 Model training process

Dl denotes the labeled examples in trained dataset with a size of |L|, \( {D}_l^0=\left\{\left(X1,y1\right),\left(X2,y2\right),\dots, \left( Xl, yl\right)\right\} \), and Du denote the unlabeled examples in test dataset with a size of |U|, Du = {Xl + 1, Xl + 2, …, Xl + u}. As shown in Fig. 1b), the workflow of the self-learning semi-supervised deep learning machine can be described as follows:

Initialize:

In the supervised deep learning module, \( {D}_l^0 \) is used as a training set to train the deep learning machine L. Then, in the unsupervised deep learning module, the pseudo-labels of \( {D}_u^{\prime } \)= {(\( {X}_{l+1},{\hat{y}}_{l+1} \)), (\( {X}_{l+2},{\hat{y}}_{l+2} \)), …, (\( {X}_{l+u},{\hat{y}}_{l+u} \)}} are generated by the trained deep learning machine L and their confidence values σ. If σ0 is the threshold to filter the unconfident pseudo labels in \( {D}_u^{\prime } \), then the confident pseudo label set of \( {D}_u^{\prime } \) can be expressed as \( {D}_{pseu}^0 \) = ((\( {X}_{l+i},{\hat{y}}_{l+i} \)), (\( {X}_{l+i+1},{\hat{y}}_{l+i+1} \)), …, (\( {X}_{l+p+i},{\hat{y}}_{l+p+2} \))) with a size of |P0|.

Repeat:

Then, the new training set\( {D}_l^1=\left|{D}_l^0\bigcup {D}_{pseu}^0\right|=\left\{\left({X}_1,{y}_1\right),\kern0.5em \left({X}_1,{y}_1\right),\kern0.5em \dots, \kern0.5em \left({X}_l,{y}_l\right),\kern0.5em \dots, \kern0.5em \left({X}_{l+p},{y}_{l+p}\right)\right\} \) is used to retrain the deep learning machine L to generate new confident pseudo label set \( {D}_{pseu}^2 \) with a size of |P1| and a new training set \( {D}_l^2=\left|{D}_l^1\bigcup {D}_{pseu}^1\right| \). Repeat this step until \( {D}_{pseu}^t={D}_{pseu}^{t+1} \). The experiments proved that this algorithm converges to the optimal solution at a greater speed.

2.2 The basic architecture of deep learning machine L

The deep learning machine L is constructed by adding a confident-level layer to existing neural networks, such as recurrent neural networks (RNN), CNN, long short-term memory (LSTM) and BI-LSTM. Here, we take BI-LSTM as an example to introduce the architecture of deep learning machine L. The major components of the deep learning machine L are described below:

2.2.1 Token embedding layer

The token embedding layer maps each token in the input sequence to a token embedding. The word vectors are trained in more ways like one-hot embedding, distributed representation, Neural Network Language Models, word2vec, BERT etc. Word2vec was selected as our token embedding layer in this work. Extracting a text sequence S = ω0, ω1, …, ωs from a collection, if the forward calculation process of the Skip-gram models is written in mathematical form, we get:

$$ \mathrm{p}\left({\omega}_0|{\omega}_i\right)=\frac{e^{U_0\cdot {V}_i}}{\sum_j{e}^{U_j\cdot {V}_i}}, $$

where, Vi is a column vector of the matrix in embedding layer, also be called the input vector of ωi. Uj is a row vector of the matrix in softmax layer, also known as the output vector of ωi.

The loss function of the Skip-gram models is obtained by adding the probability of positive and negative examples in target corpus by using binary logistic regression.

$$ J\left(\theta \right)=\log\ \mu \left({U}_0\cdot {V}_i\right)+\sum \limits_{j=1}^k{E}_{\omega_j\sim {p}_n\left(\omega \right)}\left[ log\mu \left(-{U}_j\cdot {V}_i\right)\right]. $$

2.2.2 Dropout layer

Dropout layer is a simple way to prevent neural networks from overfitting. Large networks are also slow to use, and thus difficult to deal with overfitting by combining the predictions of many different large neural nets during testing. The key idea of Dropout is to randomly drop units (together with their connections) from the neural network during training, which prevents units from excessive co-adapting. The random sampling probability was set to 0.5 in this paper, and the sampling probability can also be determined by the verification set.

2.2.3 BI-LSTM layer

In this subsection, we start with a brief review of the fundamentals of BI-LSTM networks, which are a type of RNN and composed of forward LSTM and backward LSTM. LSTM selectively forgets part of the historical information through three gates (input gate, forget gate and output gate), adds part of the current input information, and finally integrates it into the current state to generate the output state. It takes a sequence \( {\left\{{x}_s\right\}}_{s>1}^S \) of length S as its input and outputs a S-long sequence of \( {\left\{{h}_s\right\}}_{s>1}^S \) hidden state vectors using the following equations:

$$ {\displaystyle \begin{array}{c}{f}_t=\sigma \left({W}_f\cdot \left[{h}_{t-1},{x}_t\right]+{b}_f\right),\\ {}{i}_t=\sigma \left({W}_i\cdot \left[{h}_{t-1},{x}_t\right]+{b}_i\right),\\ {}\begin{array}{c}{C}_t={f}_t\ast {C}_{t-1}+{i}_t\ast \mathit{\tanh}\left({W}_C\cdot \left[{h}_{t-1},{x}_t\right]+{b}_C\right),\\ {}{o}_t=\sigma \left({W}_o\cdot \left[{h}_{t-1},{x}_t\right]+{b}_o\right),\\ {}{h}_t={o}_t\ast \mathit{\tan}\ h\left({C}_t\right).\end{array}\end{array}} $$

Where h0 = 0. The sigmoid σ and tanh functions are applied element-wise. The W matrices and b vectors are the trainable parameters of the LSTM.

2.2.4 Softmax layer

Using a fully-connected neural network, the label prediction layer maps the output from the token BI-LSTM layer to a sequence of vectors containing the probability of each label for each corresponding token. The softmax layer, one of the most popular units for neutral network, is used for multi-classification in this paper. It maps the output of many neurons into the interval of (0,1), which can be understood as the probabilities of multi-classification. For \( {D}_l^0 \)= {(X1,y1), (X2,y2), …, (Xl,yl)} with k classifications y(i) ∈ {1, 2, 3, …, k}. For every input Xi,

$$ p\left(y=j\right|{X}_i\Big)=\left[\begin{array}{c}\begin{array}{c}p\left({y}^{(i)}=1\right|{X}_i;\theta \Big)\\ {}p\left({y}^{(i)}=2\right|{X}_i;\theta \Big)\end{array}\\ {}\vdots \\ {}\left({y}^{(i)}=k\right|{X}_i;\theta \Big)\end{array}\right]=\frac{1}{\sum_{j=1}^k{e}^{\theta_j^T\cdot {X}_i}}\left[\begin{array}{c}\begin{array}{c}{e}^{\theta_1^T\cdot {X}_i}\\ {}{e}^{\theta_2^T\cdot {X}_i}\end{array}\\ {}\vdots \\ {}{e}^{\theta_k^T\cdot {X}_i}\end{array}\right]. $$

2.2.5 Confidence-function layer

As described in Section 2.1, the confidence-function layer is set to calculate the confidence value σ of each element in Du, and generate the pseudo labels in \( {D}_u^{\prime } \). For every input Xi,

$$ {\sigma}_{X_i}=\mathit{\max}\left(0,p\left(y=j\right|{X}_i\right)\Big), $$

Suppose σ0 is the threshold to filter the unconfident pseudo-labels in \( {D}_u^{\prime } \), then the confident pseudo-label of an element Xi in \( {D}_u^{\prime } \) c, if \( {\sigma}_{X_i}>{\sigma}_0 \), then,

$$ {\hat{y}}_i=\left\{\begin{array}{c}1,\kern0.4em if\ j= argmax\ p\left(y=j\right|{X}_i\Big)\\ {}0, otherwise\kern6.5em \end{array}\right.. $$

Then we obtain the whole confident pseudo-label set of \( {D}_u^{\prime } \), \( {D}_{pseu}^0 \) with a size of |P0|. Then, the new train set \( {D}_l^1=\left|{D}_l^0\bigcup {D}_{pseu}^0\right|=\left\{\left({X}_1,{y}_1\right),\kern0.5em \left({X}_1,{y}_1\right),\kern0.5em \dots, \kern0.5em \left({X}_l,{y}_l\right),\kern0.5em \dots, \kern0.5em \left({X}_{l+p},{y}_{l+p}\right)\right\} \) is used to retrain the deep learning machine L, to generate a new confident pseudo label set \( {D}_{pseu}^2 \) with a size of |P1| and a new training set \( {D}_l^2=\left|{D}_l^1\bigcup {D}_{pseu}^1\right| \). Repeat this step until \( {D}_{pseu}^t={D}_{pseu}^{t+1} \).

3 Experiments and results

3.1 Materials and datasets

The fake news data repository FakeNewsNet consists of two comprehensive datasets, each featuring news content, social context, and spatiotemporal information, which were released in 2019 and are also being constantly updated. The latest update version of PolitiFact and GossipCop datasets from FakeNewsNet repository was used to detect fake news in this paper.

3.2 Evaluation metrics

To evaluate the performance of the self-learning semi-supervised deep learning model and compare with other existing machine learning and deep learning methods, we use precision, recall and F1-measure as experiment metrics:

$$ {\displaystyle \begin{array}{c} precision=\frac{TP}{TP+ FP}, recall=\frac{TP}{TP+ FN},\\ {}{F}_1=\frac{2\ast precision\ast recall}{precision+ recall}.\end{array}} $$

Here, True Positive (TP) equals the number of data that are correctly identified. False Positive (FP) equals the number of data which are mistakenly identified. False Negative (FN) is the number of data which are not identified.

3.3 Experimental results

We evaluated the performance of the self-learning semi-supervised deep learning model by comparing with machine learning methods such as SVM and NB, and deep learning methods, BI-LSTM network, CNN. Our methods used L as deep learning machine, respectively. We used the default settings provided in the scikit-learn, without tuning parameters.

As shown in Table 1(a), when we used 80% of labeled data for training and 20% of unlabeled data for testing, the self-learning semi-supervised deep learning model based on L achieved a precision of 0.90, a recall score of 0.86, and a F1-score of 0.88, respectively, demonstrating the best performance. The results of the deep learning methods were not significantly different, but were about 30% better than those of the machine learning methods. As shown in Table 1(b), when we used 50% of labeled data for training and 50% of unlabeled data for testing, the precision of our method was 0.88, about 30% higher than that of machine learning methods and 10% higher than that of deep learning methods.

Table 1 (a) Experimental results when 80% of labeled data were used for training and 20% of unlabeled data for testing; (b) Experimental results when 50% of labeled data were used for training and 50% of unlabeled data for testing; (c) Experimental results when 20% of labeled data were used for training and 80% of unlabeled data for testing

As shown in Table 1(c), when we used 20% of labeled data for training and 80% of unlabeled data for testing, the precision of our method was 0.88, about 40% higher than that of machine learning methods and 15% higher than that of deep learning methods, which proved that our method performed better and more consistently in the case of incomplete annotated training datasets or relatively small datasets than supervised learning methods such as deep learning models and machine learning methods.

4 Conclusion

The fast spread of fake news has raised concerns around the world recently. These fake political news may have severe consequences, the identification of them grows in importance. In this paper, we designed a self-learning semi-supervised deep learning network to detect fake news on social media. A confidence network layer automatically returns and add corrects results to help the neural network to accumulate positive sample cases. We used FakeNewsNet dataset to demonstrate the superior accuracy of our method over other state-of-the-art supervised learning methods models. When we used 80% of labeled data for training and 20% of unlabeled data for testing, the self-learning semi-supervised deep learning model based on L achieved a precision of 0.90, a recall score of 0.86, and a F1-score of 0.88, respectively, demonstrating the best performance, about 30% better than those of the machine learning methods. When we used 50% of labeled data for training and 50% of unlabeled data for testing, the precision of our method was 0.88, about 30% higher than that of machine learning methods and 10% higher than that of deep learning methods. When we used 20% of labeled data for training and 80% of unlabeled data for testing, the precision of our method was 0.88, about 40% higher than that of machine learning methods and 15% higher than that of deep learning methods. Which proved that our method performed better and more consistently in the case of incomplete annotated training datasets or relatively small datasets.

In the future work, as the self-learning semi-supervised deep learning network proposed in our paper can automatically return and add correct results with a small amount of labeled data to accumulate positive sample cases, we will collect and establish fake news datasets on social media related to COVID-19, and use semi-supervised learning methods to detect fake news about COVID-19. We will also try to apply self-learning semi-supervised deep learning networks to the detection of multi-source and multi-class fake news.