1 Introduction

Nowadays, Online Social Networks (OSNs) have rapidly developed and become an effective platform for communication. According to recent surveys, there are about 3 billion users in OSNs and many users considered OSNs as the source of their daily information [46]. Unfortunately, OSNs are also exploited for the purpose of spreading misinformation, rumors and fake news, which have caused significant economical and political consequences, see [1, 13, 30]. Therefore, it is a great practical importance to effectively detect the propagation of misinformation in the social media before it causes serious consequences. This task is the motivation for several strategies to prevent misinformation such as blocking users or links [21, 38, 41, 42] and disseminating good information to correct misinformation [4, 39]. Recently, some works have shown that misinformation and fake news can be automatically detected by machine learning techniques from temporal, structural, linguistic features of users [23], content of posts and microblog-specific memes [32,33,34, 44]. We consider all techniques used to misinformation detection to exploit user’s behaviours as “monitor/sensor” placements.

Based on those studies, some authors have proposed optimal management of resources approaches to detect misinformation or outbreaks in a social network by placing the monitors/sensors at some critical nodes, such as, cascade of epidemics detection with a cost constraint [25], misinformation detection with a size of monitor-set constraint [52, 53], timely misinformation detection by heuristic approaches [54], etc. However, previous works have been failed to deal with many real scenarios. Suppose that we need to monitor all the users in a group in an OSN, monitor placement strategies with the cost and size constraints are not feasible because it may not be possible to monitor all users in the group. In this scenario, a monitor placement strategy with minimal size to ensure that all nodes in the group can be monitored, is obviously more efficient than previous strategies.

Motivated by the above phenomenon, in this paper we propose the Minimum Budget for Misinformation Detection (\({\mathsf {MBD}}\)) problem which aims to find the smallest set of nodes to place monitors in a social network so that the detection function \({{\mathbb {D}}}(\cdot )\) which evaluates information spread from a given set of suspected misinformation node S, is at least a given threshold \(\gamma \). The threshold \(\gamma \) can control the scale of misinformation monitoring strategy. The greater the value of \(\gamma \) is, the much more users are monitored. \({\mathsf {MBD}}\) is more relevant in practice as we often have to monitor misinformation throughout a network. The main challenge of this problem comes from its inapproximability and the complexity for calculating detection function. We show that the calculation of the objective function is #P-hard and it is NP-hard to approximate the problem with the ratio of \((1-\epsilon )\ln n\). To overcome this challenge, we propose two randomized algorithms with provable guarantees. Our contributions are summarized as follows:

  • We formulate Minimum Budget for Misinformation Detection (\({\mathsf {MBD}}\)) under the well-known Independent Cascade (\({\mathsf {IC}}\)) information diffusion model. We show that the calculation of the objective function is #P-hard and it is NP-hard to approximate the problem with the ratio of \((1-\epsilon )\ln n\), for \(\epsilon >0\) unless NP \(\in DTIME(n^{O \log \log n})\).

  • We develop novel techniques to estimate function \({{\mathbb {D}}}(\cdot )\) by proposing influence sample (\({\mathsf {DS}}\)) and importance influence sample (\({\mathsf {IDS}}\)) concepts with correctness proof. Based on that, we show that \({{\mathbb {D}}}(\cdot )\) is a monotone and submodular function and propose a Greedy algorithm providing an approximation ratio of \(1+ \ln (\gamma /\epsilon )\) for any \(\epsilon >0\). In order to find the solution on large-scale networks, we further propose two efficient randomized algorithms, named Sampling-based Misinformation Detection (\({\mathsf {SMD}}\)) and Importance Sampling-based Misinformation Detection (\({\mathsf {ISMD}}\)), by utilizing the estimations of detection function from \({\mathsf {DS}}\) and \({\mathsf {IDS}}\) concepts and devising an algorithmic framework to select a near-optimal solution. We prove that these algorithms return a solution A satisfying \(|A| \le 1+ |A^*| \cdot \ln \frac{\gamma -\gamma \epsilon }{\epsilon }\) and \( {{\mathbb {D}}}(A) \ge \gamma \cdot \frac{1-\epsilon }{1+\epsilon }-\epsilon \) with high probability where \(\epsilon \in (0,1)\) is an input and \(A^*\) is an optimal solution.

  • We conduct extensive experiments on real social networks to demonstrate the effectiveness and scalability of our algorithms. \({\mathsf {SMD}}\) and \({\mathsf {ISMD}}\) not only give an approximation guarantee, but also can apply to very large-scale networks (Email-Eu-All network contains 265K nodes and 420K edges) and they outperform state-of-the-art algorithms in term of quality solution and running time. In addition, the results also show that \({\mathsf {ISMD}}\) needs fewer the number of required samples and memories than that of other algorithms.

Organization The rest of the paper is organized as follows. We summarize the related literature in Sect. 2. Next, we introduce the information diffusion model, problem definition and its inapproximability in Sect. 3. In Sect. 4 we present our proposed algorithms. The experiments on several datasets are in Sect. 5. Finally, we conclude the paper in Sect. 6.

2 Related works

In this section, we are going to review previous works regarding misinformation detection including: Information Diffusion models and Influence Maximization and Misinformation Detection.

Information diffusion model and influence maximization Information diffusion models is the solid background for studying information propagation issues and viral marketing. Kempe et al. [20] first propose two classical information diffusion models, named Independent Cascade (\({\mathsf {IC}}\)) and Linear Threshold (\({\mathsf {LT}}\)). Working on these models, they formulate the Influence Maximization (\(\mathsf {IM}\)) problem which seeks k nodes that can influence to the largest number of nodes in an OSN and they devise an \((1-1/e)\) approximation algorithm for this problem. Due to great commercial values, a large number of works have focused on \(\mathsf {IM}\) problem on proposing scalability and efficiency algorithms [3, 7, 8, 48, 49], studying \(\mathsf {IM}\) variations [2, 19, 28, 37, 43, 50]. Some works extend the \({\mathsf {IC}}\) model by incorporating time, topic to reflect some real contexts in viral marketing. Chen et. al. [9] introduce the Independent Cascade with Meeting events model by adding the time delay aspect of influence diffusion in each link. A Continuous-time Independent Cascade model is proposed for influence estimation and maximization problems with time-sensitive context [14, 17]. Several works [2, 15, 29] consider the topic-aware influence maximization with the purpose of maximizing influenced users with a given topic query. In this problem, each edge has multiple transmission information probabilities that reflect the influence on different topics.

Misinformation detection Misinformation, fake news and rumors can be automatically detected by using text mining techniques from sequential microblog streams [5, 23, 31, 44]. For example, Qazninian et al. [44] study rumors identifying on the Twitter by developing three categories of features to identify the false tweets (uni gram, bigrams, and part-of-speech). Kwon et al. [23] introduce a time-series-fitting model standing on the volume of tweets over time. Ma et al. [31] capture the temporary characteristics of the time series of rumor’s life-cycle for identifying rumors from online social media. More recently, several deep neural models have developed for automatic rumors detection [6, 32,33,34,35, 45].

The outbreak of disease occurs in many different networks. There is a common strategy to detect outbreaks from many different networks which is to place monitors or sensors into some important nodes such as water contamination [22] and detection and contagious outbreaks [10]. Besides, motivated by the fact that misinformation or rumor can be automatically detected through data mining and machine learning methods, some authors investigate the problem of placing monitor/sensor into some nodes to detect misinformation in an OSN [12, 12, 53, 54]. Leskovec et al. investigate the problem of detection outbreaks in a blog network under the budget constraint, and devise an \((1-1/e)/2\) approximation algorithms for this problem [25]. Cui et al. [12] focus on selecting important nodes as sensors to predict the outbreaks with a data-driven approaching. Zhang et al. [53] investigate the problem of misinformation detection within (MD) limited budget under \({\mathsf {IC}}\) model. They show that MD problem can be viewed as \(\mathsf {IM}\) problem when all nodes have the same probability to be a source of misinformation but they still fail to deal with the case which nodes have different probabilities to be a source. Authors in [54] focus on TCMD problem, which finds minimum-size monitor set so that misinformation can be detected from all nodes in the network within time constraint and propose a heuristic algorithm for general case. One drawback of these two studies is that the proposed algorithms do not provide any approximation guarantee.

Approaching a new view of above studies, in this work, we aim to find set of nodes with minimal size to place monitors in so that the expected detection probability is at least a threshold \(\gamma \). Different from Misinformation Detection problem [53], in this task, each node u is a source of misinformation with arbitrary probability. Besides, there is no existing algorithm for our problem, we are going to propose approximation algorithms that returns near-optimal solutions with high probability.

3 Model and problem definition

In this section, we introduce the network model and a well-known diffusion model Independent Cascade (\({\mathsf {IC}}\)) [20]. We then formally define the Minimum Budget for Misinformation Detection (\({\mathsf {MBD}}\)) and present the inapproximability of the problem. In Table 1, the frequently used notations are summarized.

Table 1 Table of symbols

3.1 Independent cascade model

Let \(G=(V, E)\) be a directed graph representing a social network with a node set V and a directed edge set E, \(|V|=n\) and \(|E|=m\). Let \(N_{in}(v)\) and \(N_{out}(v)\) be the set of in-neighbors and out-neighbor of a node v, respectively.

In this model, each edge \(e=(u, v) \in E\) has a probability \(p(u, v) \in (0, 1)\) that represents the misinformation transmission from u to v. The diffusion process from S happen in discrete steps \(t=0,1,2 \ldots \) as follows:

  • At step \(t=0\), all nodes in S are activated by the misinformation and the others are inactive.

  • At step \(t\ge 1\), for an activated node u in previous steps, it has a single chance to activate each inactive neighbour v with the successful probability p(uv). An activated node u remains active till the end of the diffusion process.

  • The propagation process ends at step t if there is no new activated node in this step.

3.2 Problem definition

We adopt the Independent Cascade (\({\mathsf {IC}}\)) model to abstract the misinformation diffusion in a social network. In this problem, we denote \(S\subseteq V\) is the suspected set, i.e, the set of nodes that is likely to be the source of misinformation. Each node \(u \in S\) is a source of misinformation with probability \(\rho (u)\ge 0\).

In \({\mathsf {IC}}\) model, we observe that the activations along edges are mutually independent. From the perspective of graph theory, the successful transmission from an user to its neighbors can be represented as an existence of the edges between them. Kempe et al. [20] show that \({\mathsf {IC}}\) model is equivalent to the reachability in a random graph g, called live-edge graph or sample graph. Accordingly, we can generate a sample graph g from original graph G, denoted by \(g \sim G\), by selecting each edge \(e=(u, v) \in E\), independently, with the probability p(uv), and no select edge (uv) with the probability \(1- p(u, v)\). The probability that a realization g can be generated from G is:

$$\begin{aligned} \Pr [g\sim G]=\prod _{e \in E(g)} p(u, v) \prod _{e \in E\setminus E(g)} (1-p(u, v)) \end{aligned}$$
(1)

In above equation, E(g) is the set node of g. The number of sample graphs is \(2^{|E|}\). If we place a monitor on node v, it will detect misinformation from the nodes that are connected to it. It takes \(d_g(u, v)\), the distance from u to v, hops to detect the misinformation from u. For a node \(u \in S\), the probability that A can detect misinformation propagated from u is:

$$\begin{aligned} {{\mathbb {D}}}(A, u)=\sum _{g \sim G}\Pr [g \sim G] R(A,g, u) \end{aligned}$$
(2)

where

$$\begin{aligned} R(A, g, u)= {\left\{ \begin{array}{ll} 1, &{}\text{ if }\ d_g(u, A)<\infty , \\ 0, &{}\text{ Otherwise } \end{array}\right. } \end{aligned}$$
(3)

and \(d_g(u, A)=\min _{v\in A}d_g(u, v)\). Since the probability that \(u \in S\) is a source of misinformation node is \(\rho (u)\), we define the detection function of A as follows:

$$\begin{aligned} {{\mathbb {D}}}(A)=\sum _{u \in S} \rho (u) \sum _{g \sim G}\Pr [g \sim G] R(A, g, u) \end{aligned}$$
(4)

In this work, we study Minimum Budget for Misinformation Detection problem (\({\mathsf {MBD}}\)) defined as follows:

Definition 1

(\({\mathsf {MBD}}\) problem) Given a graph \(G=(V, E)\) under \({\mathsf {IC}}\) model, a suspected set \(S\subseteq V\) and each node \(u\in S\) is a source of misinformation with probability \(\rho (u)\ge 0\). Given a threshold for detection misinformation \( \gamma > 0\), find the set of nodes \(A \subseteq V\) with minimum-size to place monitors so that \({{\mathbb {D}}}(A) \ge \gamma \).

When all nodes have same probability to be the misinformation source, Zhang et. al [53] show that the detection function of a set of nodes A is equal to the influence spread of A on the reverse graph. Since calculating influence spread is #P-Hard [7], calculating detection function is also #P-Hard. Besides, we show the inapproximability of the problem by the following Theorem.

Theorem 1

(Inapproximability) \({\mathsf {MBD}}\) cannot be approximated within a factor \((1-\epsilon )\ln n\) unless NP \(\in DTIME(n^{O \log \log n})\).

Proof

We consider the decision version of \({\mathsf {MBD}}\) defined as follows: Given a graph \(G=(V, E)\), a suspected set \(S \subseteq V\), a threshold \(\gamma \), and a positive number \(k>0\). The problem asks whether or not the monitor set A of size k so that \({{\mathbb {D}}}(A)\ge \gamma \)? \(\square \)

We reduce \({\mathsf {MBD}}\) from the Set Cover problem defined as follows:

Definition 2

(Set Cover (SC) problem) Given a positive integer t, an universal set \({{\mathcal {U}}}= \{e_1, e_2,\ldots , e_M\}\) and a collection of subsets \({{\mathcal {S}}}= \{S_1, S_2,\ldots , S_N \}\). The Set Cover problem asks whether or not there are t subsets whose union is \({{\mathcal {U}}}\)?

Reduction Given an instance \({{\mathcal {I}}}=({{\mathcal {U}}}, {{\mathcal {S}}}, t)\) of SC problem, we construct an instance \({{\mathcal {I}}}'=(G, S, \gamma , k)\) as follows: For each element \(e_i \in {{\mathcal {U}}}\), we create a node \(u_i \in S\), and set \(\rho (u)=1\). For each subset \(S_j \in {{\mathcal {S}}}\), we add a node \(v_j\) into S and add an edge \((u_i, v_j)\) if \(e_i \in S_j\) and set the probability \(p(u_i, v_j)=1\). For convenience, we denote sets \(X=\{u_i, i=1, \ldots , M\}\), and \(Y=\{v_j, j=1, \ldots , N\}\). Finally, we set \(\gamma =M+k\) and \(t=k\). We can see that the reduction can be done in polynomial-time respect to size of \({{\mathcal {I}}}\) and \({{\mathcal {I}}}'\).

Suppose that \({{\mathcal {I}}}\) has a solution \(S'\subseteq {{\mathcal {S}}}\). By our reduction, in \({{\mathcal {I}}}'\) we choose a monitor set \(A=\{v_j| S_j \in S'\}\). We have \({{\mathbb {D}}}(A)=M+t=\gamma \). This implies that A to be a solution of \({{\mathcal {I}}}'\). Conversely, suppose that \(A \in V\) is the solution of \({{\mathcal {I}}}'\), i.e, \({{\mathbb {D}}}(A) \ge \gamma =M+k\) and \(|A|= k\). Since each node \(u_i \in X\) can only detect itself, A only contains some nodes in Y. From \(|A|= k\), we imply that A can detect M node in X. Now, if we choose \(S'=\{S_i| v_j \in A \}\), then \(S'\) is a solution of \({{\mathcal {I}}}\).

Suppose that there is an algorithm which can approximate \({\mathsf {MBD}}\) within a ratio of \((1-\epsilon ) \ln n\) in polynomial time. By applying this algorithm and using our reduction, we can approximate SC within a ratio of \((1-\epsilon ) \ln n\) in polynomial time. This contradicts to the fact that SC does not have a polynomial- time \((1-\epsilon ) \ln n\) -approximation for any \(\epsilon >0\) unless NP \(\in DTIME(n^{O \log \log n})\) [16]. \(\square \)

4 Proposed algorithms

In this section, we propose three algorithms for \({\mathsf {MBD}}\) problem including \({\mathsf {Greedy}}\), Sampling-based for Misinformation Detection (\({\mathsf {SMD}}\)) and Importance Sampling-based for Misinformation Detection (\({\mathsf {ISMD}}\)). \({\mathsf {Greedy}}\) provides an approximation ratio of \(1+ \ln (\gamma /\epsilon )\) but it cannot be applied to medium-networks even using the Monte-Carlo method to estimate the detection function because of its high complexity. \({\mathsf {SMD}}\) and \({\mathsf {ISMD}}\) are scalable algorithms with theoretical guarantees by developing a framework to select a final solution from multiple candidate solutions. The main difference between these algorithms is \({\mathsf {SMD}}\) estimates \({{\mathbb {D}}}(\cdot )\) by using the concept of detection sample while \({\mathsf {ISMD}}\) uses the concept of importance detection sample instead. Also, we show that \({\mathsf {ISMD}}\) takes lower complexity and uses fewer number of required samples than that of \({\mathsf {SMD}}\).

4.1 Estimator of detection function

We introduce the concept of Detection Sampling (\({\mathsf {DS}}\)) set and use it to estimate \({{\mathbb {D}}}(\cdot )\).

Definition 3

(\({\mathsf {DS}}\) set) Given a graph \(G=(V, E)\) under \({\mathsf {IC}}\) model, let \(\rho (S)=\sum _{u\in S} \rho (u)\). A \({\mathsf {DS}}\) set \(R_j\) is generated from G by:

  1. 1.

    Picking a source node \(u \in V\) with probability \(\frac{\rho (u)}{\rho (S)}\).

  2. 2.

    Generating a sample graph g from G, and returning \(R_j\) as nodes which can be reached from u in g.

The meaning of the above definition is that each node in a \({\mathsf {DS}}\) set can detect misinformation spreading from u. Node u in the above definition is called the source of \(R_j\), denoted by \(src(R_j)=u\). We denote \(\varOmega \) is the probability space of all \({\mathsf {DS}}\) sets in which the probability of generating a \({\mathsf {DS}}\) set \(R_j\) having the source node u (denoted by \(R_j(u)\)) can be computed as follows:

$$\begin{aligned} \Pr [R_j(u) \sim \varOmega ]=\frac{\rho (u)}{\rho (S)} \cdot \sum _{g \sim G: R(R_j, g, u)=1} \Pr [g \sim G] \end{aligned}$$
(5)

If we generate multiple \({\mathsf {DS}}\) sets, the nodes that can detect misinformation from many other nodes will likely appear frequently in the \({\mathsf {DS}}\) sets. Basically, the role of \({\mathsf {DS}}\) is similar to the Reachable Reverse (RR) set in estimating influence spread function [3, 36, 47,48,49]. We define a random variable \(X_j(A)\) as follows:

$$\begin{aligned} X_j(A) ={\left\{ \begin{array}{ll} 1, &{}\text{ If } R_j \cap A \ne \emptyset \\ 0, &{} \text{ Otherwise } \end{array}\right. } \end{aligned}$$
(6)

Similar to Lemma 2 in [3], Lemma 1 shows that we can use the value of \(X_j(A)\) to estimate function \({{\mathbb {D}}}(A)\).

Lemma 1

For any set of nodes \(A \subseteq V\), let \(\rho (S)=\sum _{u \in S}\rho (u)\) we have:

$$\begin{aligned} {{\mathbb {D}}}(A)= \rho (S) \cdot {{\mathbb {E}}}[X_j(A)] \end{aligned}$$
(7)

Proof

Since the source node u is chosen with probability \(\frac{\rho (u)}{\rho (S)}\), we have:

$$\begin{aligned} {{\mathbb {D}}}(A)&=\sum _{u \in V } \rho (u) \sum _{g \sim G}\Pr [g\sim G] R(A, g, u) \end{aligned}$$
(8)
$$\begin{aligned}&= \sum _{g \sim G}\sum _{u \in V} \rho (u)R(A, g, u) \Pr [g\sim G] \end{aligned}$$
(9)
$$\begin{aligned}&=\sum _{g \sim G} \rho (S) \sum _{u \in V} R(A, g, u) \Pr [g\sim G] \frac{\rho (u)}{\rho (S)} \end{aligned}$$
(10)
$$\begin{aligned}&=\rho (S) \sum _{g \sim G} \sum _{u \in V} {\mathsf {Cov}}(R_j^g(u), A) \Pr [g\sim G] \Pr [u \text{ is } \text{ the } \text{ source } \text{ node}] \end{aligned}$$
(11)
$$\begin{aligned}&= \rho (S) \cdot {{\mathbb {E}}}[X_j(A)] \end{aligned}$$
(12)

where \(R_j^g(u)\) is a \({\mathsf {DS}}\) corresponding to the sample graph g and the source node u. \(\square \)

4.2 Greedy algorithm

We introduce \({\mathsf {Greedy}}\) algorithm that provides an approximation ratio of \(1+ \ln (\gamma /\epsilon )\) based on the submodular and monotone properties of the \({{\mathbb {D}}}(\cdot )\) function, i.e, for \(A \subseteq T \subseteq V, v \notin T\) \( {{\mathbb {D}}}(A+ \{v\})- {{\mathbb {D}}}(A) \ge {{\mathbb {D}}}(T+ \{v\})- {{\mathbb {D}}}(T) \).

Lemma 2

The function \({{\mathbb {D}}}(A)\) is monotone and submodular.

Proof

Rewrite Eq. (11), we have:

$$\begin{aligned} {{\mathbb {D}}}(A)&=\sum _{g \sim G} \sum _{u \in V} \Pr [g\sim G] \rho (u) {\mathsf {Cov}}(R_j^g(u), A) \end{aligned}$$
(13)

the above equation shows that the detection function \({{\mathbb {D}}}(A)\) is equivalent to the weighted coverage function of a set cover system in which: every \(R_j\) is an element in the set of all \({\mathsf {DS}}\) set and each node in V is a subset and V is a collection of subsets. Each node v covers set \(R_j\) if v belongs to \(R_j\). The value of \(\Pr [g\sim G] \rho (u)\) is the weight of an element \(R_j^g(u)\). Since the weighted coverage function is monotone and submodular, it has the same properties with \({{\mathbb {D}}}(A)\). \(\square \)

Lemma 2 help us design an \((1+ \ln \frac{\gamma }{\epsilon })\)-approximation algorithm by applying Greedy algorithm in [18] (Algorithm 1), where \(\epsilon \in (0, \gamma )\) is an input.

At the beginning, this algorithm initiates a solution A as an empty set. The main phase of the algorithm operates in multiple iterators (line 2-5). In each iterator, it simply chooses a node u that provides the largest incremental detection function, defined as follows:

$$\begin{aligned} \delta (A, u)= \min ({{\mathbb {D}}}(A\cup \{u\}), \gamma )-{{\mathbb {D}}}(A) \end{aligned}$$
(14)

until \({{\mathbb {D}}}(A) \ge \gamma -\epsilon \). However, we can not implement \({\mathsf {Greedy}}\) even for small networks because calculating the detection function is #P-hard. To address this challenge, we can use the Monte Carlo simulation method to estimate the detection function. Let R be the maximum time needed to estimate the \({{\mathbb {D}}}(\cdot )\) by using Mote-Carlo simulation method, \({\mathsf {Greedy}}\) takes O(Rnk) time complexity, where k is the number of iterators in algorithm.

figure a

4.3 Sampling-based for misinformation detection algorithm

This algorithm combines two novel techniques: (1) generating a collection of \({\mathsf {DS}}\) sets that is enough to estimate the detection function by applying martingale theory and (2) a new framework for generating candidate solutions and checking their quality to select the final solution. Denote \( {\mathsf {Cov}}_{{{\mathcal {R}}}}(A)=\sum _{R_j \in {{\mathcal {R}}}}\min \{1,|A\cap R_j| \} \) as the number of \({\mathsf {DS}}\) sets in \({{\mathcal {R}}}\) covered by A. From Lemma 1 we obtain an estimation of \({{\mathbb {D}}}(A)\) from \({{\mathcal {R}}}\) as follows:

$$\begin{aligned} {\hat{{\mathbb {D}}}}(A)= \frac{\rho (S)}{|{{\mathcal {R}}}|}{\mathsf {Cov}}_{{{\mathcal {R}}}}(A) \end{aligned}$$
(15)

Since \({\mathsf {Cov}}_{{{\mathcal {R}}}}(\cdot )\) is monotone and submodular, \({\hat{{\mathbb {D}}}}(\cdot )\) is also monotone and submodular.

Detection sampling algorithm We first devise an algorithm for generating a \({\mathsf {DS}}\) which is inspired by the Breath-First-Search (BFS) algorithm, formally described below as Algorithm 2. It first selects a source node u with probability \(\frac{\rho (u)}{\rho (S)}\) (line 1), then uses a queue Q to store the visited nodes and initiates a \({\mathsf {DS}}\) set \(R_j=\{u\}\). The rest of this algorithm operates in several iterators. At each iterator, it picks a node u from Q and adds u into \(R_j\), then selects each neighbor node v (not belong to Q) with probability p(uv) according to the live-edge model (line 8). If v is selected, it is put into Q. Otherwise, the algorithm moves to next iterator. This process repeats until Q becomes an empty set.

figure b

4.3.1 Description of \({\mathsf {SMD}}\) algorithm

figure c

This algorithm first generates a set \({{\mathcal {R}}}\) containing \(N=\frac{(2+ \frac{2}{3} \epsilon )\rho (S) }{\epsilon ^2(\gamma -\epsilon \gamma )} \ln (n/\delta )\) \({\mathsf {DS}}\) sets which ensures \((\delta , \epsilon )\)-approximation for optimal solution \(A^*\) (Lemma 4), i.e,

$$\begin{aligned} \Pr [(1+\epsilon ){{\mathbb {D}}}(A^*) \ge {\hat{{\mathbb {D}}}}(A^*) \ge (1-\epsilon ) {{\mathbb {D}}}(A^*)] \ge 1-\delta \end{aligned}$$
(16)

The algorithm initiates an empty candidate solution A and its main phase happens in several iterators (line 4-18) to select the final solution from multiple candidate solutions.

Firstly, we observe that the candidate solutions may have different sizes and we do not know the size of the final solution. Therefore, in each iterator i, we maintain a set \({{\mathcal {R}}}\) with the size at least:

$$N_i(\delta , \epsilon ) = \frac{(2+ \frac{2}{3} \epsilon )n }{\epsilon ^2(\gamma -\epsilon \gamma )} \ln (\left( {\begin{array}{c}n\\ i\end{array}}\right) /\delta )$$

which guarantees the bi-criterion approximation (Theorem 2) for the candidate solution A with size \(|A|=i\). The algorithm then selects node u, which provides the largest incremental of estimation of detection function \({\hat{\delta }}(A, v)= {\hat{{\mathbb {D}}}}(A \cup v)- {\hat{{\mathbb {D}}}}(A)\) into the candidate solution A (line 5).

The algorithm then checks the quality of the candidate solution A in line 7. If the set A satisfies \({\hat{{\mathbb {D}}}}(A)\ge \gamma -\epsilon \gamma -\epsilon \) then the algorithm returns A. If not, it checks whether the current number of samples is enough or not for the next iterator (the size of candidate solution increasing by 1) (line 12). If yes, it moves to next iterators. If not, it generates more \(N_i-N\) \({\mathsf {DS}}\) sets, adds them into \({{\mathcal {R}}}\) (line 13) and resets current solution A, i.e, it sets A as an empty set (line 15). The algorithm moves to next iterator and constructs an another candidate solution from an empty set. The algorithm terminates only when it meets the condition \({\hat{{\mathbb {D}}}}(A) \ge \gamma -\epsilon \gamma -\epsilon \).

4.3.2 Theoretical analysis

We now analyze the approximation guarantee of \({\mathsf {SMD}}\) using the martingale theory [11] which is used for studying information propagation problems [36,37,38, 47, 49].

Definition 4

(Martingale) A sequence of random variable \(T_1, T_2, T_3, \ldots , T_l\) is a martingale, if only if \({{\mathbb {E}}}[T_i] \le + \infty \) and \({{\mathbb {E}}}[T_i|T_1, T_2, T_3, \ldots , T_{i-1}]=T_{i-1}\) for any \(i=2\ldots l\).

The following concentration inequality [11] for martingales that have similar flavor to the Chernoff bounds.

Lemma 3

([11], Theorem 6.1) If \(T_1, T_2, \ldots , T_l\) be a form of martingale satisfying

  1. 1)

    \(|T_1| \le a\), \(|T_j -T_{j-1}|\le a\), for \(1\le j\le l\)

  2. 2)

    \({\mathsf {Var}}[T_j| T_1, T_2, \ldots , T_{j-1}]\le \sigma _i^2\), for \(1\le j \le l\)

where \({\mathsf {Var}}[\cdot ]\) denotes the variance of a random variable. Then, for any \(\lambda \), we have:

$$\begin{aligned} \Pr [T_i - {{\mathbb {E}}}[T_i] \ge \lambda ] \ge \exp \left( -\frac{\lambda ^2}{\frac{2}{3}a \lambda + 2\sum _{i=1}^l\sigma _i^2} \right) \end{aligned}$$
(17)

Given a collection of \({\mathsf {DS}}\) sets \({{\mathcal {R}}}\), we consider a sequence of the random variables \(\{X_j(A)\}, j=1,\ldots , |{{\mathcal {R}}}|\). We observe that \(X_j(A) \in [0, 1]\), let a random variable \(M_i=\sum _{j=1}^i (X_j(A) - \mu _X), \forall i \ge 1\), where \(\mu _X={{\mathbb {E}}}[X_j]\). For a sequence of random variables \(M_1, M_2, \ldots , M_{|{{\mathcal {R}}}|}\), we have

$$\begin{aligned} {{\mathbb {E}}}[M_i| M_1, \ldots , M_{i-1}]= {{\mathbb {E}}}[M_{i-1}]+ {{\mathbb {E}}}[X_i(A) - \mu _X]={{\mathbb {E}}}[M_{i-1}], \forall i=2, \ldots , |{{\mathcal {R}}}| \end{aligned}$$

Hence, \(M_1, M_2, \ldots , M_{|{{\mathcal {R}}}|}\) be a form of martingale [11]. Apply Lemma 3 with \(a=1\), \({\mathsf {Var}}[M_j| M_1, M_2, \ldots , M_{j-1}]={\mathsf {Var}}[X_j(A)-\mu _X]={\mathsf {Var}}[X_j(A)] \le 1\), \(l=|{{\mathcal {R}}}|\) and \(\lambda =\epsilon |{{\mathcal {R}}}|\mu _X\) we have

$$\begin{aligned} \Pr [ \sum _{j=1}^{|{{\mathcal {R}}}|} X_j(A) \ge (1+\epsilon ) \mu _X |{{\mathcal {R}}}|] \le \exp \left( \frac{-\epsilon ^2 |{{\mathcal {R}}}| \mu _X }{2+ \frac{2}{3} \epsilon } \right) \end{aligned}$$
(18)

Similarly, \(- M_1, \ldots , -M_i, \ldots , - M_{|{{\mathcal {R}}}|}\) also form a martingale and applying Lemma 3 will gives the following probabilistic inequality.

$$\begin{aligned} \Pr [\sum _{j=1}^{|{{\mathcal {R}}}|} X_j(A) \le (1- \epsilon ) \mu _X |{{\mathcal {R}}}|] \le \exp \left( \frac{-\epsilon ^2 |{{\mathcal {R}}}| \mu _X }{2} \right) \end{aligned}$$
(19)

By applying two inequalities above, we obtain the following Lemma indicating a importance property of the optimal solution.

Lemma 4

Given \(\epsilon , \delta \in (0,1)\). If \(|{{\mathcal {R}}}| \ge \frac{(2+ \frac{2}{3} \epsilon )\rho (S)}{\epsilon ^2(\gamma -\epsilon \gamma )} \ln \frac{1}{\delta }\), we have

$$\begin{aligned} \Pr [{\hat{{\mathbb {D}}}}(A^*) \ge \gamma -\epsilon \gamma ] \ge 1-\delta \end{aligned}$$
(20)

where \({\hat{{\mathbb {D}}}}(A^*)\) is calculated by Eq. (15) and \(A^*\) is an optimal solution.

Proof

Denote \(\mu ={{\mathbb {D}}}(A^*)/\rho (S),{\hat{\mu }}={\hat{{\mathbb {D}}}}(A^*)/\rho (S)\), apply (19) we have

$$\begin{aligned} \Pr [{\hat{{\mathbb {D}}}}(A^*) \le \gamma -\epsilon \gamma ]&\le \Pr [{\hat{{\mathbb {D}}}}(A^*) \le (1-\epsilon ){{\mathbb {D}}}(A^*)] \end{aligned}$$
(21)
$$\begin{aligned}&= \Pr [{\hat{\mu }} \le (1-\epsilon )\mu ] \le \exp \left( \frac{-\epsilon ^2 |{{\mathcal {R}}}| \mu }{2} \right) \end{aligned}$$
(22)
$$\begin{aligned}&\le \exp \left( \frac{-\epsilon ^2 |{{\mathcal {R}}}| {\hat{\mu }} }{2(1-\epsilon )} \right) \end{aligned}$$
(23)
$$\begin{aligned}&\le \exp \left( - \frac{(2+\frac{2}{3} \epsilon ) {\hat{{\mathbb {D}}}}(A^*) }{2(1-\epsilon )(\gamma -\epsilon \gamma )} \ln \frac{1}{\delta } \right) \le \delta \end{aligned}$$
(24)

This completes the proof \(\square \)

Theorem 2

Given \(\epsilon , \delta \in (0, 1)\), the Algorithm 3 returns a solution A with

  1. a)

    \(\Pr [|A|\le 1+ |A^*| \cdot \ln \frac{\gamma -\gamma \epsilon }{\epsilon }] \ge 1-\frac{\delta }{n} \).

  2. b)

    \(\Pr \left( {{\mathbb {D}}}(A) \ge \gamma \cdot \frac{1-\epsilon }{1+\epsilon }-\epsilon \right) \ge 1-\delta \).

Proof

We consider the case when while-loop (line 4-18) terminates. Assume that the algorithm returns a solution \(A=\{a_1, a_2, \ldots , a_p\}\), denote \(A_i=\{a_1, a_2, \ldots , a_i\}, i\le p\), we have the number of samples \(|{{\mathcal {R}}}| = \frac{(2+ \frac{2}{3} \epsilon )\rho (S) }{\epsilon ^2(\gamma -\epsilon \gamma )} \ln (\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) /\delta )\), where

$$\begin{aligned} i_{max} =\arg \max _{i: 1\ldots p} \frac{(2+ \frac{2}{3} \epsilon )\rho (S) }{\epsilon ^2(\gamma -\epsilon \gamma )} \ln \frac{\left( {\begin{array}{c}n\\ i\end{array}}\right) }{\delta } \end{aligned}$$
(25)

Proof a) Let \(B=\{b_1, b_2, \ldots , b_l\}\) be a set of minimum size satisfying \({\hat{{\mathbb {D}}}}(B) \ge \gamma -\epsilon \gamma \), we have:

$$\begin{aligned} \gamma - \epsilon \gamma - {\hat{{\mathbb {D}}}}(A_i)&\le {\hat{{\mathbb {D}}}}(A_i \cup B ) -{\hat{{\mathbb {D}}}}(A_i) \\&= \sum _{j=1}^l \left( {\hat{{\mathbb {D}}}}(A_i\cup \{b_1, b_2, \ldots , b_j\})- {\hat{{\mathbb {D}}}}(A_i\cup \{b_1, b_2, \ldots , b_{j-1}\})\right) \\&\le \sum _{j=1}^l ({\hat{{\mathbb {D}}}}(A_i \cup \{b_j\})- {\hat{{\mathbb {D}}}}(A_i)) \ \ (\text{ Since } {\hat{{\mathbb {D}}}}(\cdot ) \text{ is } \text{ submodular}) \\&\le l \cdot ({\hat{{\mathbb {D}}}}(A_i)-{\hat{{\mathbb {D}}}}(A_{i-1})) \end{aligned}$$

Therefore,

$$\begin{aligned} \gamma - \epsilon \gamma - {\hat{{\mathbb {D}}}}(A_i)&\le (1-\frac{1}{l})\left( \gamma - \gamma \epsilon - {\hat{{\mathbb {D}}}}(A_{i-1})\right) \end{aligned}$$
(26)
$$\begin{aligned}&= \left( 1-\frac{1}{l} \right) ^{i}(\gamma -\gamma \epsilon ) \le e^{-i/l} (\gamma -\gamma \epsilon ) \end{aligned}$$
(27)

Because the candidate solution A meets condition in line 7, and by the definition of \(A_i\), we have \({\hat{{\mathbb {D}}}}(A_p) \ge \gamma -\epsilon \gamma -\epsilon \) and \({\hat{{\mathbb {D}}}}(A_{p-1}) < \gamma -\epsilon \gamma -\epsilon \). Combine with (27), we have:

$$\begin{aligned}&(\gamma - \gamma \epsilon ) e^{\frac{-p-1}{l}} \ge (\gamma - \gamma \epsilon ) - {\hat{{\mathbb {D}}}}(A_{p-1}) \ge (\gamma -\epsilon \gamma )-(\gamma -\epsilon \gamma -\epsilon ) = \epsilon \end{aligned}$$
(28)
$$\begin{aligned} \Longrightarrow \&p \le 1 + l \cdot \ln \frac{\gamma -\gamma \epsilon }{\epsilon } \end{aligned}$$
(29)

From Lemma 4, we have \( \Pr [{\hat{{\mathbb {D}}}}(A^*) \ge \gamma -\epsilon \gamma ] \ge 1-\delta / \left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) \). Due to the definition of B, the following event happens with the probability at least \(1-\delta / \left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) \)

$$\begin{aligned} |A| \le 1+ |B|\ln \frac{\gamma -\gamma \epsilon }{\epsilon }\le 1+ |A^*| \ln \frac{\gamma -\gamma \epsilon }{\epsilon } \end{aligned}$$
(30)

Hence, \(\Pr [|A| \le 1+ |A^*| \cdot \ln \frac{\gamma -\gamma \epsilon }{\epsilon }] \ge 1-\delta /\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) \ge 1-\delta /n\).

Proof b) Since \({\hat{{\mathbb {D}}}}(A) \ge \gamma -\epsilon \gamma -\epsilon \) where A is a solution returned by Algorithm 3. Therefore,

$$\begin{aligned} \Pr \left( {{\mathbb {D}}}(A) \le \gamma \frac{1-\epsilon }{1+\epsilon }-\epsilon \right)&\le \Pr \left( {{\mathbb {D}}}(A) \le \frac{\gamma -\gamma \epsilon -\epsilon }{1+\epsilon }\right) \le \Pr \left( {{\mathbb {D}}}(A) \le \frac{{\hat{{\mathbb {D}}}}(A)}{1+\epsilon }\right) \\&= \Pr [{\hat{{\mathbb {D}}}}(A) \ge (1+\epsilon ){{\mathbb {D}}}(A)] = \Pr [{\hat{\mu }} \ge (1+\epsilon )\mu ] \\&\le \exp \left( \frac{-\epsilon ^2 |{{\mathcal {R}}}| \mu }{2+\frac{2}{3}\epsilon }\right) \le \exp \left( \frac{-\epsilon ^2 |{{\mathcal {R}}}| {\hat{\mu }} }{(2+\frac{2}{3}\epsilon ) (1+\epsilon )}\right) \\&=\exp \left( -\frac{\ln \frac{\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) }{\delta }}{(1+\epsilon )} \right) \le \frac{\delta }{\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) } \le \frac{\delta }{\left( {\begin{array}{c}n\\ p\end{array}}\right) } \end{aligned}$$

Since \(|A|=p\), there are at most \(\left( {\begin{array}{c}n\\ p\end{array}}\right) \) possible solutions. Therefore,

$$\begin{aligned} \Pr \left( \exists A, {{\mathbb {D}}}(A) \le \gamma \cdot \frac{1-\epsilon }{1+\epsilon } -\epsilon \right) \le \delta \end{aligned}$$
(31)

Hence,

$$\begin{aligned} \Pr \left( \forall A, {{\mathbb {D}}}(A) \ge \gamma \cdot \frac{1-\epsilon }{1+\epsilon }- \epsilon \right) \ge 1- \delta \end{aligned}$$
(32)

The proof is completed \(\square \)

Complexity of \({\mathsf {SMD}}\) algorithm The number of required samples in the worst-case \(\frac{(2+ \frac{2}{3} \epsilon )\rho (S) }{\epsilon ^2(\gamma -\epsilon \gamma )} \ln (\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) /\delta )=O(\rho (S)\ln (\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) /\delta )\epsilon ^{-2})\). Denote M is the expected running time for generating one sample, the time complexity of Algorithm 3 is

$$O\left( i_{max}\rho (S)\ln (\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) /\delta )\epsilon ^{-2}M \right) $$

4.4 Importance sampling-based misinformation detection algorithm

We next introduce the \({\mathsf {ISMD}}\) algorithm, an improved version of \({\mathsf {SMD}}\), which provides the same approximation guarantee with \({\mathsf {SMD}}\) but requires fewer samples than \({\mathsf {SMD}}\). The main idea of this algorithm is that we propose an importance detection sample (\({\mathsf {IDS}}\)) concept to estimate \({{\mathbb {D}}}(\cdot )\) function instead of \({\mathsf {DS}}\).

4.4.1 Importance detection sampling

We observe that \({\mathsf {DS}}\) sets contain only one node contributing insignificantly in calculating the detection function. Therefore, we only consider the generation of \({\mathsf {DS}}\) sets that contain more than one node, called \({\mathsf {IDS}}\) sets. We show that the detection function can be estimated through the \({\mathsf {IDS}}\) sets (Lemma 5).

We now describe how to generate an \({\mathsf {IDS}}\). For a source node u, \(\varOmega _u\) denotes the set of all \({\mathsf {DS}}\) sets that have a source u. We divide \(\varOmega _u\) into two components:

  • Trivial samples: the set contains only a source node u, called \(\varOmega _u^0\).

  • Importance samples: \(\varOmega _u^i = \varOmega _u \setminus \varOmega _u^0\).

For a source node u, let \(E_0\) be the event that none of nodes in \(N_{out}(u)\) is activated by u, we have:

$$\begin{aligned} \Pr [E_0]=\prod _{v \in N_{out}(u)}(1-p(u, v)) \end{aligned}$$
(33)

The probability that at least a node in \(N_{out}(u)\) is influenced by u is equal to the probability of generating an importance detection sample from u:

$$\begin{aligned} \varphi (u)= 1- \Pr [E_0]=1- \prod _{u \in N_{out}(u)}(1-p(u, v)) \end{aligned}$$
(34)

To generate a \({\mathsf {IDS}}\) set, we construct \(\varOmega _u^i\) according to following analysis. Assume that \(N_{out}(u)=\{v_1, v_2, \ldots , v_{l(u)}\}\) with \(|N_{out}(v)|=l(u)\), suppose \(E_i\) as the event that \(v_i\) is the first node in \(N_{out}(u)\) which is influenced by u, we have:

$$\begin{aligned} \Pr [E_i]= p(u, v_i) \cdot \prod _{j=1}^{i-1}(1-p(u, v_i)) \end{aligned}$$
(35)

By definition of \(E_i\), we have:

$$\begin{aligned} \sum _{i=1}^{l(u)}\Pr [E_i]/\varphi (u)=1 \ \ \ \text{ end }\ \ E_i \cap E_j =\emptyset \end{aligned}$$
(36)

\(\varOmega _n\) denotes the probability spaces of all \({\mathsf {IDS}}\) samples. The probability that an \({\mathsf {IDS}}\) set \(R_j\) with the source node u generated from \(\varOmega _n\) is

$$\begin{aligned} \Pr [R_j(u) \sim \varOmega _n] =\frac{1}{\varphi (u)} \Pr [R_j(u) \sim \varOmega ] \end{aligned}$$
(37)

The probability that a node u is a source node of an \({\mathsf {IDS}}\) \(R_j\) in \(\varOmega \) is \(\frac{\rho (u)}{\rho (S)} \varphi (u)\). By normalizing factor to fulfill a probability distribution of all \({\mathsf {IDS}}\) samples, the probability that u is the source node of a \({\mathsf {IDS}}\) \(R_j\) in \(\varOmega _n\) is:

$$\begin{aligned} \Pr [src(R_j)=u]= \frac{\rho (u) \varphi (u)}{\sum _{v \in V} \rho (v)\varphi (v)}=\frac{\rho (u) \varphi (u)}{\varPhi } \end{aligned}$$
(38)

Where \(\varPhi =\sum _{v \in V} \rho (v)\varphi (v)\). For any \({\mathsf {IDS}}\) set \(R_j\), we have:

$$\begin{aligned} \Pr [R_j \sim \varOmega ]&= \sum _{u \in V} \Pr [u \text{ is } \text{ source } \text{ of } R_j \text{ in } \varOmega ] \Pr [R_j(u) \sim \varOmega ] \end{aligned}$$
(39)
$$\begin{aligned}&= \sum _{u \in V} \frac{\rho (u)}{\rho (S)} \cdot \varphi (u) \Pr [R_j(u) \sim \varOmega _n] \end{aligned}$$
(40)
$$\begin{aligned}&=\frac{\varPhi }{\rho (S)} \cdot \sum _{u \in V} \frac{\rho (u) \varphi (u)}{\varPhi } \Pr [R_j(u) \sim \varOmega _n] \end{aligned}$$
(41)
$$\begin{aligned}&= \frac{\varPhi }{\rho (S)} \cdot \sum _{u \in V} \Pr [u \text{ is } \text{ source } \text{ of } R_j \text{ in } \varOmega _n] \Pr [R_j(u) \sim \varOmega _n] \end{aligned}$$
(42)
$$\begin{aligned}&= \frac{\varPhi }{\rho (S)}\cdot \Pr [R_j \sim \varOmega _n] \end{aligned}$$
(43)

We define two random variables \(Z_j(A)\) and \(Y_j(A)\) as follows:

$$\begin{aligned} Z_j(A) ={\left\{ \begin{array}{ll} 1, \ \text{ If } R_j \cap A \ne \emptyset \\ 0, \text{ Otherwise } \end{array}\right. } \end{aligned}$$
(44)

And,

$$\begin{aligned} Y_j(A) =\frac{\ \varPhi \cdot Z_j(A)+ \sum _{v \in A}(1-\varphi (v))\rho (v)}{\rho (S)} \end{aligned}$$
(45)

We have \(Y_j(A) \in [Y_{min}, Y_{max}]\), with \(Y_{min}=\frac{\sum _{v \in A}(1-\varphi (v))\rho (v)}{\rho (S)}\), \(Y_{max}=\frac{\varPhi + \sum _{v \in A}(1-\varphi (v))\rho (v)}{\rho (S)}\), we have following Lemma:

Lemma 5

For any set of nodes \(A \subseteq V\), we have:

$$\begin{aligned} {{\mathbb {D}}}(A)= \varPhi \cdot {{\mathbb {E}}}[Z_j(A)] + \sum _{v \in A}(1-\varphi (v))\rho (v)=\rho (S) \cdot {{\mathbb {E}}}[Y_j(A)] \end{aligned}$$
(46)

Proof

From Lemma 1, we have

$$\begin{aligned} {{\mathbb {D}}}(A)&= \rho (S) \cdot \sum _{R_j \in \varOmega } \Pr [R_j \sim \varOmega ] X_j(A) \end{aligned}$$
(47)
$$\begin{aligned}&= \rho (S) \cdot \left( \sum _{R_j \in \varOmega _0} \Pr [R_j \sim \varOmega ] X_j(A)+ \sum _{R_j \in \varOmega _n} \Pr [R_j \sim \varOmega ] X_j(A)\right) \end{aligned}$$
(48)

Since each \(R_j \in \varOmega _0\) contains only a source node, we have:

$$\Pr [R_j \sim \varOmega ]=\frac{\rho (u)}{\rho (S)}(1-\varphi (u))$$

where \(u=src(R_j)\). Put it into (48), we have:

$$\begin{aligned} {{\mathbb {D}}}(A)&= \rho (S) \sum _{u\in A} \frac{\rho (u)}{\rho (S)}(1-\varphi (u)) +\rho (S) \sum _{R_j \in \varOmega _n} \Pr [R_j \sim \varOmega ] {\mathsf {Cov}}(A, R_j) \\&= \sum _{u\in A} \rho (u)(1-\varphi (u))+ \rho (S) \sum _{R_j \in \varOmega _n} \frac{\varPhi }{\rho (S)} \Pr [R_j \sim \varOmega _n] {\mathsf {Cov}}(A, R_j) \\&= \sum _{u\in A} \rho (u)(1-\varphi (u))+ \varPhi \sum _{R_j \in \varOmega _n} \Pr [R_j \sim \varOmega _n] Z_j(A) \\&=\sum _{u\in A} \rho (u)(1-\varphi (u))+ \varPhi {{\mathbb {E}}}[Z_j(A)] \\&= \rho (S) \cdot {{\mathbb {E}}}[Y_j(A)]\ \ (\text{ Due } \text{ to } \text{ the } \text{ definition } \text{ of } Y_j(A)) \end{aligned}$$

which competes the proof. \(\square \)

From Lemma 5, we have another estimation of \({{\mathbb {D}}}(A)\) by utilizing a set of \({\mathsf {IDS}}\) \({{\mathcal {R}}}\):

$$\begin{aligned} {\hat{{{\mathbb {D}}}}}(A)= \frac{\varPhi }{|{{\mathcal {R}}}|}\sum _{R_j \in {{\mathcal {R}}}}{\mathsf {Cov}}(A, R_j) + \sum _{v \in A}(1-\varphi (v))\rho (v) \end{aligned}$$
(49)

From the above analysis, we propose an Importance Detection Sampling algorithm to generate an \({\mathsf {IDS}}\) set by modifying Algorithm 2. The details of this algorithm are described in Algorithm 4.

The algorithm first selects a source of \({\mathsf {IDS}}\) with the probability according to e.q (38) (line 1). Then, it calculates probabilities \(\Pr [E_i], i=1 \ldots , l(u)\) and selects a first out-neighbour \(u_i\) in \(N_{out}(u)\) with probability \(\Pr [E_i]/\varphi (u)\) (line 3). This guarantees that at least one of the out-neighbors of u will be selected. Similar to Alg. 2, this algorithm also uses a queue Q to store the visited nodes and initiates an \({\mathsf {IDS}}\) \(R_j=\{u\}\). At this time, \(Q=R_j=\{u, v_i\}\) (line 4-5). The nodes \(v_1, v_2, \ldots , v_{j-1}\) are ignored and nodes \(v_{i+1}, \ldots , v_{l}\) are then selected independently with probability \(p(u, v_j), j=i+1 \ldots l\) (line 7) and added into Q and \(R_j\) (line 9). The rest of this algorithm is similar to the while-loop (line 3-14) in Alg. 2 because of the similarity between \({\mathsf {IDS}}\) and \({\mathsf {DS}}\) from this step.

figure d

4.4.2 Description of \({\mathsf {ISMD}}\) algorithm

The details of \({\mathsf {ISMD}}\) is presented in Algorithm 5. This algorithm works similarly to \({\mathsf {SMD}}\) algorithm, the main difference between these two algorithms lies in the following two factors. Firstly, \({\mathsf {ISMD}}\) generates \({\mathsf {IDS}}\) sets instead of \({\mathsf {DS}}\) (line 2) and uses them for estimating the detection function by Eq. (49). Secondly, the number of required samples of \({\mathsf {ISMD}}\) in each iterator is lower than that of \({\mathsf {SMD}}\). Specifically, \({\mathsf {ISMD}}\) needs \(\frac{q(2+ \frac{2}{3} \epsilon )\rho (S) }{\epsilon ^2(\gamma -\epsilon \gamma )} \ln (\left( {\begin{array}{c}n\\ i\end{array}}\right) /\delta ), (q <1)\) samples, which is fewer than that of \({\mathsf {SMD}}\) a factor of \(q, (q<1)\).

figure e

4.4.3 Theoretical analysis

We show that \({\mathsf {ISMD}}\) has the same approximation guarantee with \({\mathsf {SMD}}\) but \({\mathsf {ISMD}}\) needs fewer samples than \({\mathsf {SMD}}\). From that on, we also point out that the complexity of \({\mathsf {ISMD}}\) is less than that of \({\mathsf {SMD}}\). Firstly, by applying Lemma 3, we have following Lemma

Lemma 6

For any \(T=|{{\mathcal {R}}}|>0, \lambda >0\), \(\mu \) is the mean of \(Y_j(A)\), and an estimation of \(\mu \) is \({\hat{\mu }}=\frac{\sum _{i=1}^T Y_j(A)}{T}\). Let \(q=Y_{max}-Y_{min}= \frac{\varPhi }{\rho (S)}\), we have:

$$\begin{aligned}&\Pr \Big [\sum _{j=1}^T Y_j(A) - T \cdot \mu \ge \lambda \Big ] \le \exp \left( -\frac{\lambda ^2}{\frac{2}{3}q\lambda + 2 q \mu T}\right) \end{aligned}$$
(50)
$$\begin{aligned}&\Pr \Big [\sum _{j=1}^T Y_j(A) - T \cdot \mu \ge - \lambda \Big ] \le \exp \left( -\frac{\lambda ^2}{ 2q\mu T}\right) \end{aligned}$$
(51)

Proof

For any set \(A\subseteq V\), since \(Y_j(A) \in [ Y_{min}, Y_{max}]\) we have

$$\begin{aligned} {\mathsf {Var}}[Y_j(A)] \le ( \mu - Y_{min}) (Y_{max}- \mu ) \le (Y_{max}-Y_{min})\mu =q \cdot \mu \end{aligned}$$
(52)

Choose randomly variable \(M_i=\sum _{j=1}^i (Y_j(A) - \mu ), \forall i \ge 1\), where \(\mu ={{\mathbb {E}}}[Y_j]\). We can easily show that \(M_1, M_2, \ldots \) is a form of martingale [11]. Applying Lemma 3, with \(a=q\), \({\mathsf {Var}}[M_j| M_1, M_2, \ldots , M_{j-1}]={\mathsf {Var}}[Y_j(A)-\mu ]={\mathsf {Var}}[Y_j(A)] \le q\), \(T=l\) and \(\lambda =\epsilon T\mu _X\) we have

$$\begin{aligned} {\mathsf {Var}}[M_1]+ \sum _{j=2}^i {\mathsf {Var}}[M_j| M_1, M_2, \ldots , M_{j-1}] = \sum _{j=1}^T {\mathsf {Var}}[T_j(A)]\le q \mu T \end{aligned}$$
(53)

Applying Lemma 3 with \(a=q\) and \(b=T q \mu \), and put back into (17) we obtain (50). Similarly, \(-M_1,\ldots , -M_i, \ldots \) also form a martingale and by applying (17), we obtain (51). \(\square \)

Lemma 7

Given \(\epsilon , \delta \in (0,1)\). If \(|{{\mathcal {R}}}| \ge \frac{q(2+ \frac{2}{3} \epsilon )\rho (S)}{\epsilon ^2(\gamma -\epsilon \gamma )} \ln \frac{1}{\delta }\), we have \( \Pr [{\hat{{\mathbb {D}}}}(A^*) \ge \gamma -\epsilon \gamma ] \ge 1-\delta \) where \({\hat{{\mathbb {D}}}}(A)\) is calculated by (49).

Proof

Applying Lemma 6, with \(\lambda =\epsilon T \mu , q=\frac{\varPhi }{\rho (S)}\) we have

$$\begin{aligned} \Pr [{\hat{{\mathbb {D}}}}(A^*) \le \gamma -\epsilon \gamma ]&\le \Pr [{\hat{{\mathbb {D}}}}(A^*) \le (1-\epsilon ){{\mathbb {D}}}(A^*)] \end{aligned}$$
(54)
$$\begin{aligned}&= \Pr [{\hat{\mu }} \le (1-\epsilon )\mu ] \le \exp \left( \frac{-\epsilon ^2 |{{\mathcal {R}}}| \mu }{2q} \right) \end{aligned}$$
(55)
$$\begin{aligned}&\le \exp \left( - \frac{(2+\frac{2}{3} \epsilon ) {\hat{{\mathbb {D}}}}(A^*) }{2(1-\epsilon )(\gamma -\epsilon \gamma )} \ln \frac{1}{\delta } \right) \le \delta \end{aligned}$$
(56)

The proof is completed. \(\square \)

Theorem 3

Given \(\epsilon , \delta \in (0, 1)\), the Algorithm 5 returns a solution A satisfying:

  1. a)

    \(\Pr [|A|\le 1+ |A^*| \cdot \ln \frac{\gamma -\gamma \epsilon }{\epsilon }] \ge 1-\frac{\delta }{n} \).

  2. b)

    \(\Pr \left( {{\mathbb {D}}}(A) \ge \gamma \cdot \frac{1-\epsilon }{1+\epsilon }-\epsilon \right) \ge 1-\delta \).

The proof of Theorem 3 applies Lemma 7 and is similar to the proof of Theorem 2.

Complexity of \({\mathsf {ISMD}}\) algorithm Denote M is the expected running time for generating a \({\mathsf {IDS}}\) set, Algorithm 5 requires \(q\frac{(2+ \frac{2}{3} \epsilon )\rho (S) }{\epsilon ^2(\gamma -\epsilon \gamma )} \ln (\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) /\delta )\) in the worst-case and thus the complexity of Algorithm 5 is:

$$O\left( qi_{max}\rho (S)\ln (\left( {\begin{array}{c}n\\ i_{max}\end{array}}\right) /\delta )\epsilon ^{-2})M\right) $$

The \({\mathsf {ISMD}}\) and \({\mathsf {SMD}}\) algorithms provide the same theoretical result, however, the sample complexity of \({\mathsf {ISMD}}\) is smaller than that of \({\mathsf {SMD}}\) by a factor of \(q, (q<1)\), which leads to the fact that the running time of \({\mathsf {ISMD}}\) is less than that of \({\mathsf {SMD}}\). This observation is consistent with the experiment results on the real social networks in Sect. 5.

5 Experiment

In this section, we conduct comprehensive experiments to compare the performance of our proposed algorithms to the state-of-the-art ones on three aspects: solution quality (size of monitor set), running time and memory usage.

5.1 Experimental settings

Datasets For the comprehensive experimental purpose, we select a diverse set of 5 datasets with different sizes. The description of those datasets is provided in Table 2.

Table 2 Datasets
  • Email-Eu-Core The network was generated using email data from a large European research institution. The e-mails only represent communication between institution members (the core), and the dataset does not contain incomming and outgoing messages outside the institution.

  • Wiki-Vote The network contains all the Wikipedia voting data from the inception of Wikipedia until January 2008. Nodes in the network represent wikipedia users and a directed edge from node i to node j represents that user i voted on user j.

  • CA-HepPh Arxiv HEP-PH (High Energy Physics - Phenomenology) collaboration network is from the e-print arXiv.org and covers scientific collaborations between authors’ papers submitted to High Energy Physics - Phenomenology category. The data covers papers in the period from January 1993 to April 2003 (124 months).

  • CA-AstroPh Arxiv ASTRO-PH (Astro Physics) collaboration network is from the e-print arXiv.org and covers scientific collaborations between authors’ papers submitted to Astro Physics category. The data also covers papers in the period from January 1993 to April 2003 (124 months).

  • Email-Eu-All The network was generated using email data from a large European research institution. For a period from October 2003 to May 2005 (18 months) they have anonymized information about all incoming and outgoing email of the research institution.

Algorithms compared We compare \({\mathsf {Greedy}}\), SMD, and ISMD with \({\mathsf {OPIM}}\) [47], the state-of-the-art Reverse Reachable (RR) sampling algorithm for Influence Maximization problem and several common baselines for investigating information diffusion problem [7, 20, 42, 53, 54]. In baseline algorithms, we use Monte Carlo method with 10,000 times to estimate of detection function. For each algorithm, we run 10 times to get the average results. Details of these algorithms are described as follows:

  • \({\mathsf {OPIM}}\) [47]: This is the current state-of-the-art algorithm that use the sample technique to solve the \(\mathsf {IM}\) problem, where the number of seeds k is an input. Because of the similarity between \({\mathsf {DS}}\) and RR set, we use \({\mathsf {OPIM}}\) for the \({\mathsf {MBD}}\) problem in comparison with our algorithms. Besides, since our problem asks to minimize the number of monitor nodes, this algorithm cannot be applied directly. Therefore, we adapt this algorithm with some modifications by performing a binary search on k in the interval [1, n]. We choose this approach over starting at \(k = 1\) and at each iterator of the binary search, \({\mathsf {OPIM}}\) utilizes the value of k in question until the algorithm finds the minimum k so that the estimation of the value of \({{\mathbb {D}}}\) is at least \(\gamma \). There are at most \(\log _2 n\) iterators.

  • Degree: The heuristic algorithm based on the measurement of degree. We select nodes with the highest degree and we keep on adding the highest-degree nodes until detection function of the selection of nodes exceeds \(\gamma \).

  • Random: We randomly select nodes until detection function of the selection of nodes exceeds \(\gamma \).

  • PageRank [40]: A link analysis algorithm to rank the importance of pages in a Web graph. We implement the power method with a damping factor of 0.85 and keep on adding the highest-rank node until detection function exceeds \(\gamma \).

Weight settings We use Trivalency model [7, 20, 36, 55] to choose the weight of the edges. In this model, instead of assuming all nodes are equally influential, influence probabilities are drawn uniformly at random from a predetermined set of probabilities, here we used \(\{0.001, 0.01, 0.1\}\). The idea of this model is that nodes whose (outgoing) influence is 0.001 can be thought of low influence nodes, with 0.01 corresponding to medium influence and 0.1 to high influence.

Parameters In all the experiments, we keep \(\delta =1/n\) as a general setting [36, 47,48,49]. Suspected nodes are randomly chosen with the size n/2 and the probability \(\rho (u)\) is randomly chosen in [0, 1]. We choose \(\epsilon \) and \(\gamma \) depending on the size of the network. Denote \(\varPsi =\gamma /\rho (S)\) reflect the relation between \(\gamma \) and \(\rho (S)\). The values of these parameters are described in Table 3. The running time of each algorithm is limited within 24 hours.

Table 3 Values of \(\varUpsilon \) and \(\epsilon \) for each network

Environment All our experiments are carried out using a Linux machine with a 2 x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz 8 x 16 GB DIMM ECC DDR4 @ 2400MHz. Our implementation is written in C++ and compiled with GCC 4.7. We use OPENMP library for parallel programming.

Fig. 1
figure 1

Size of solutions of algorithms

5.2 Experiment results

Solution quality We first compare the quality solutions of algorithms which are measured by the size of monitor set. The results is presented in Fig. 1 in which the better algorithm returns the smaller-size monitor set. We observe that SMD and ISMD have the same performance in all cases, which outperform other algorithms by a large gap. The bigger value of \(\varPsi \) is, the greater gap between our proposed algorithms and other algorithms is. Specificity, with the same value of \(\varPsi \), SMD and ISMD are up to 3.9 times better than \({\mathsf {OPIM}}\), 2.3 times better than Greedy. Our algorithms also are several times better than baseline methods. This proves that the proposed framework algorithm for \({\mathsf {SMD}}\) and \({\mathsf {ISMD}}\) is more efficient than the other algorithms. It not only selects the smaller-size set of nodes but also ensures the approximation guarantees of the solutions. \({\mathsf {OPIM}}\) selects too many vertices because the framework of binary search may not work well in the circumstance \({\mathsf {MBD}}\) problem. In addition, the estimation of the detection function by \({\mathsf {DS}}\) and \({\mathsf {IDS}}\) concepts gives better and more efficient results than Monte-Carlo simulation method in \({\mathsf {Greedy}}\) algorithm.

Running time Figure 2 reveals the running time of the tested algorithms. In overall, both SMD and ISMD significantly outperform the rest of the algorithms in terms of running time. Our algorithms are faster than \({\mathsf {OPIM}}\) on the most of networks (up to 1.5 times faster), \({\mathsf {OPIM}}\) only gives a better time on Email-Eu-Core network. This is because \({\mathsf {OPIM}}\) takes a long time for binary search to get the good solution. Compare with \({\mathsf {Greedy}}\), \({\mathsf {SMD}}\) is up to 10.2 times faster than \({\mathsf {Greedy}}\) and \({\mathsf {ISMD}}\) is up to 12.4 times faster than \({\mathsf {Greedy}}\). For the large networks such as Email-Eu-All, CA-AstroPh, \({\mathsf {Greedy}}\) can not finish within limited time while \({\mathsf {SMD}}\) and \({\mathsf {ISMD}}\) algorithms still work and give good results. This indicates that the estimation of detection function by \({\mathsf {DS}}\) and \({\mathsf {IDS}}\) is faster than using traditional Monte Carlo simulation method of \({\mathsf {Greedy}}\). Compare \({\mathsf {SMD}}\) to \({\mathsf {ISMD}}\), the average running time of ISMD is up to 1.4 faster than SMD. The main reason is that the number of required samples of \({\mathsf {ISMD}}\) is lower than that of \({\mathsf {SMD}}\). Unsurprisingly, the baseline algorithms have small running time since they are simple heuristic algorithms with low complexity.

Fig. 2
figure 2

Running time of the algorithms

Memory usage and number of samples The results on memory usage of the SMD and ISMD algorithms are shown in Table. 4, and the number of samples generated by them is shown in the Fig. 3. We do not represent the memory usage of Greedy and baseline algorithms because their memories are small and fixed regardless of changing \(\varPsi \). \({\mathsf {SMD}}\), \({\mathsf {ISMD}}\) and \({\mathsf {OPIM}}\) consume more memories than the other because of wasting memories for storing samples. The results show that the number of required samples and memories usage of \({\mathsf {ISMD}}\) is the smallest. The number of required samples of \({\mathsf {ISMD}}\) is up to 5.14 and 8.6 times smaller than that of \({\mathsf {SMD}}\) and \({\mathsf {OPIM}}\), respectively. Also, these results confirm our theoretical establishment in Sect. 4 that the sample complexity of \({\mathsf {ISMD}}\) is less than that of \({\mathsf {SMD}}\) by a factor of \(q<1\). \({\mathsf {OPIM}}\) needs more samples than our algorithms because it does not reuse samples generated in previous steps. Certainly, the memory usage by ISMD is lower than that SMD of \({\mathsf {OPIM}}\). However, the gap between \({\mathsf {SMD}}\) and \({\mathsf {ISMD}}\) is negligible. This is because of the number of nodes of an \({\mathsf {IDS}}\) samples is larger than that of a \({\mathsf {DS}}\), so we need a more memory to store an \({\mathsf {IDS}}\) sample. This result, besides the size of monitor set and the running time, clearly shows the superiority and efficiency of ISMD compared with SMD and \({\mathsf {OPIM}}\).

Fig. 3
figure 3

Comparison number of number of samples generated by \({\mathsf {SMD}}\), \({\mathsf {ISMD}}\) and \({\mathsf {OPIM}}\)

Table 4 Memory usage (\(\times \) 1000 MB) of \({\mathsf {SMD}}\), \({\mathsf {ISMD}}\) and \({\mathsf {OPIM}}\)

6 Conclusion

In this paper, we propose \({\mathsf {MBD}}\) problem which aims at finding the smallest set of nodes to place monitors in a social network to detect misinformation from suspected nodes so that the expected detection function is greater than or equal to a threshold \(\gamma >0\). Besides showing challenge for solving \({\mathsf {MBD}}\), we propose three algorithms including: Greedy, \({\mathsf {SMD}}\) and \({\mathsf {ISMD}}\), in which \({\mathsf {SMD}}\) and \({\mathsf {ISMD}}\) are randomized approximation algorithms that outperform other algorithms. In the future, we will further improve the running time of these algorithms making them applicable to billion-scale networks.