1 Introduction

We are all familiar with the existence of people who believe that the Earth is flat [1], or those that, without solid scientific evidence, firmly believe that we have been visited by extraterrestrial beings [2]. Furthermore, we know that in an era of social networks like Facebook, Instagram, Twitter, Twitch or YouTube, this type of ideas propagates more easily [3, 4]. Probably, most of us do not mind about these conspiracy theories. After all, it is a matter of beliefs: there is no amount of scientific evidence that could help to change the mind of a believer.

However, conspiracy theories may sometimes become dangerous. As an example, many governments and non-governmental organizations around the world strongly fought against anti-vaccination movements in the midst of the worldwide COVID-19 pandemic [5,6,7,8,9,10]. Although education and publicity campaigns based on verified information may help to oppose the spread of misinformation on the much needed vaccines, as in the case of the flat-earthers, believers are difficult to convince even in the face of abundant scientific data.

Some tech companies, such as Facebook [11, 12] and Google [13], that run social networking applications have announced several measures to combat false news. There is also a large body of research in this area. However, most of the scientific literature and company-based initiatives focus on either the automatic or man-in-the-loop detection of false news in social media [14,15,16,17]. In this work, to the contrary, we focus on a different aspect of misinformation: the network that facilitates its spread.

At the beginning of the year 2021, we witnessed a very important event: a large social network, Twitter, banned the account of the United States president [18, 19]. In spite of the political content of this measure, it signals to an alternative in the combat against misinformation and false news: the ban of misinformation super-spreaders. ‘Super-spreader’ is a term that has been introduced, at least for the general public, in the last few years in relation to the COVID-19 pandemic. A super-spreader is a person with a high viral load that gets in contact with large number of people, for example, in parties or other social gatherings (weddings or religious events). By analogy, the concept has been transferred to the realm of information. Indeed, recently some presidents of large countries have been called super-spreader of disinformation [20, 21]. The concept has caught up and it is becoming part of common language in the media [22, 23].

As a matter of fact, the idea of super-spreader is not new. It is related to the well-known concept of an opinion leader or ‘influential’ [24]. Opinion leaders are defined as ‘the individuals who are likely to influence other persons in their immediate environment’ [24]. As it is clearly explained in Watts and Dodds [24], an opinion leader is not necessarily a leader, such as the president of a country. A simple example that comes to mind is that of a company that wants to sell plastic containers. Opinion leaders in this case can be certain well-connected housewives, and not a high-profile person. In this sense, the concept of an influential is different from the nowadays common concept of ‘influencer’ [25,26,27,28,29], that is, someone ‘who built a large network of followers, and are regarded as trusted tastemakers in one or several niches’ [28].

Super-spreaders, influentials or opinion leaders, and influencers are all related, but slightly different concepts. A common factor in all of them is that of the network of connections. Super-spreaders ‘spread’ misinformation or false news because they have a large number of connections in propitious environments (as those who spread the COVID-19 infections in large gatherings in closed rooms). Influentials can be characterized by their special place in a network of contacts [24, 30,31,32,33]. Influencers are defined by their large network of ‘followers’ [28]. All in all, the network topology is in the core of the definition of these types of characters relevant in opinion formation dynamics.

A careful reading of the previous lines reveals a detail which is of the utmost relevance. Indeed, although the topology of connections in a social network is important in the definition of either a super-spreader or an influencer, that is not the only aspect that characterizes them. For example, a super-spreader not only needs a large number of connections, but also a propitious ‘environment.’ Besides being followed by a large number of people, an influencer requires to be ‘regarded as a trusted tastemaker’ [28]. In summary, there is an attribute of the super-spreader or the influencer that goes beyond their place in the social network.

The main goal of this work is to investigate whether there may exist attributes of individuals that can be more relevant than their place in the network topology when evaluating the importance of their influence. If there are such attributes, they can help to detect and combat super-spreaders, which may be more efficient than fighting the fake news they divulge. A related idea was treated in Andrade et al. [34], where it was studied how to efficiently choose what elements of a criminal network should be targeted to cause the largest damage. It was found that not only the connections in the network were important, but also attributes of the criminals. In a similar manner, we propose to identify who are the news-spreaders (influencers) that are more convenient to combat in order to cause the greatest damage to a misinformation network.

A misinformation network is different than a criminal network. There is a large literature on opinion formation and its relation to a network of contacts [24, 35,36,37,38,39,40,41,42,43,44,45,46]. Oftentimes, fake news are associated with the idea of an spreading virus. Indeed, it is common to speak of a false news becoming ‘viral.’ A classical model for the propagation of a disease is known as the susceptible–infected–recovered (SIR) model [24, 47, 48]. The basic idea of this model is to classify people in three categories: ‘susceptible,’ they can become infected; ‘infected,’ as they are sick and can infect others; and ‘recovered,’ as they were but are no longer sick. Susceptible individuals can be infected if they come into contact with infected individuals. There is a probability of infection which may depend on several factors, but mainly on the type of contact (for example, how long a susceptible individual was in contact with an infected person). Sick (infected) people recover at a constant mean rate.

It is most interesting to study the behavior of the SIR model in the context of a social network [24]. There have been several proposals on how the probability of contagion is related to the type of connection between a susceptible individual and an infected person. During the COVID-19 pandemic, we have grown used to terms such as ‘close contact.’ The closeness of the contact is an attribute of the link between the two individuals. However, we argue that the contagion probability also depends on characteristics of the infected and susceptible individuals beyond the properties of the connection. As in the example of presidents of large countries as super-spreader of (dis)information, there are attributes of the individuals themselves which are relevant. Indeed, already Katz [49] recognized three dimensions related to an influential or opinion leader (we follow Nisbet and Kotcher [50]): (i) ‘who one is’: certain personality traits; ‘what one knows’: expertise about a particular subject; and ‘whom one knows’: number of contacts. While the last dimension can be related to topological properties of the network of contacts, the first two are connected to personal attributes of the particular node. In relation to dimension (i), we can mention, for example, personality strength [51] and communication abilities [52]. Authoritative individuals are not always characterized by their knowledge (dimension (ii)), but also by their social standing in a given community [53, 54]. Another important attribute that has recently been studied is the personal motivation [55, 56]. For more information on individual attributes that may be relevant, the interested reader is referred to the works of Valente and Pumpuang [57] and Bamakan et al. [58] and references therein. In this work, we study news propagation in a Twitter network, where the number of mentions is used as a node attribute.

Our aim is to find a way of identifying the misinformation super-spreaders (influencers) which are more efficient to combat in order to produce the greatest damage to a misinformation propagation. In order to evaluate this, we propose a novel approach to the analysis of the propagation of fake news which, on the basis of a variation of the SIR model, takes into account both the topology of the social network and the attributes of the nodes.

The remaining of the paper is organized as follows. Section 2 presents a review of some of the most common (network-structure related) properties of nodes considered in the literature. In Sect. 3, we put forth a model of information diffusion that accounts for the influence power of the agents. We also discuss the relation of our model with related literature on complex contagions. Section 4 studies the behavior of the model and the impact of each of its parameters by means of simulations on both synthetic and real-world networks. Finally, we close the paper with some conclusions in Sect. 5.

2 Topological properties

Relevance of nodes is usually appraised through different centrality metrics which only consider their network-structure related characteristics. In this section, we review some of the most common centrality metrics. However, as we shall show in Sect. 4, they are not always appropriate in the context our proposed information diffusion model.

Let us consider a social network \(G = (V,E)\) where V is the set of vertices or nodes, \(E\subseteq V\times V\) is the set edges between the nodes. For the sake of clarity, in what follows we shall assume that all nodes are uniquely identified by a natural number in \(\{1,2,\ldots ,n\}\), where n is the number of vertices in the network. Edges may be either directed or undirected, although for our application directed edges are usually more appropriate. Indeed, an online network such as Instagram has many important influencers that are ‘followed’ by many users which are not necessarily followed-back. In the context of a graph, following the flow of information, we call predecessors of a node to those nodes it follows and successors to its followers. We must note that since \(E\subseteq V\times V\), only one-to-one interactions are considered in our modeling approach. However, higher-order interactions can be of great interest in the propagation of information or diseases [59, 60].

The simplest node centrality measure is its degree, i.e., the number of links to other nodes. In a directed network, we can distinguish between in (\(d^\textrm{in}\)) and out degree (\(d^\textrm{out}\)) depending on whether we account for incoming or outgoing links, respectively. In the context of information propagation, \(d^\textrm{out}\) appears to be more adequate as it refers to the news-spreading capacity of the node.

A path from node i to j in a network consists of a sequence of edges \((i_{k-1},i_k)\in E\), for \(k=1,2,\ldots , p\), where \(i_0=i\) and \(i_p=j\). A geodesic path from node i to node j is a path between them with the smallest number of edges in the sequence and this smallest number is called geodesic distance between i and j and denoted by \(d_{ij}\). For a given node i, let \(V_i\) be the set of nodes \(j\ne i\) in the network for which there exists a path from i to j. Let \(n_i\) be the number of elements in \(V_i\). Closeness centrality [61] measures the reciprocal of the mean distance from a node to other nodes in the network that it can reach by a path:

$$\begin{aligned} c_{cl}(i) = \frac{n_i}{\sum _{j\in V_i}} d_{ij}. \end{aligned}$$
(1)

We must remark that distance can be measured in other ways, e.g., incorporating edge-related properties. However, in the context of the model presented in the following section, we consider the geodesic distance. Intuitively, we may expect that, the closest a node is to all other nodes, the fastest a news will reach all nodes that can be reached from that node.

While the degree only considers how many nodes can be reached in a single hop, the rapid spread of a news also depends on the number of nodes connected by the first layer of neighbors. In this sense, eigenvector centrality [62] appears to be an adequate measure as the relevance of the node is proportional to the sum of the centralities of its neighbors. Let \(\textbf{A}\) be the adjacency matrix of the network G, i.e., a matrix such that \(\textbf{A}_{ij} = 1\) if there is a link from node i to node j, and \(\textbf{A}_{ij} = 0\) otherwise. The eigenvector centrality is defined by the vector that satisfies

$$\begin{aligned} \lambda _1\textbf{c}_{eg}^T = \textbf{c}_{eg}^T\textbf{A}, \end{aligned}$$
(2)

where \(\lambda _1\) is the largest left eigenvalue of matrix \(\textbf{A}\) and the eigenvector centrality of node i, \(c_{eg}(i)\), is given by the ith element of vector \(\textbf{c}_{eg}\). Observe that we are using left eigenvectors as the centrality of a node depends on the centralities of its successors, as it is more relevant in the new-spreading context.

A measure related to eigenvector centrality is PageRank [63], originally devised to evaluate the importance of webpages. In terms of the adjacency matrix, we can write the vector of PageRank centralities as [61]

$$\begin{aligned} \textbf{c}_{pr}^T = (1-\alpha ) \textbf{1}^T\left( \textbf{I}-\alpha \textbf{A}\textbf{D}\right) ^{-1}, \end{aligned}$$
(3)

where \(\textbf{I}\) is the \(n\times n\) identity matrix, n is the number of nodes in the network, and \(\textbf{D}\) is a diagonal matrix such that \(\textbf{D}_{ii} = \max \{1,d^\textrm{out}(i)\}\). The parameter \(\alpha \in (0,1)\) is conventionally assumed equal to 0.85.

Betweenness [64, 65] is a centrality measure which quantifies the importance of a node for the flow of information in a network. In particular, the betweenness centrality of node i is defined by

$$\begin{aligned} c_{be}(i) = \sum \limits _{j,k \ne i} \frac{\# {\text {of shortest paths from}} j {\text {to}} k {\text {that go through}} i}{\# {\text {of shortest paths from}} j {\text {to}} k}.\nonumber \\ \end{aligned}$$
(4)

Although there are many other centrality measures, we shall focus only on these ones, as they are the most common. For metrics with application to information diffusion, we refer the interested reader to, e.g., Kitsak et al. [30], Pei and Makse  [31], Pei et al.  [32], and Taha [33].

3 Proposed model

We introduce a new information diffusion model which is based on the well-known susceptible–infected–recovered (SIR) contagion model [24, 47, 48]. In particular, we consider that each node can be in any of three states:

  • Susceptible (S): it may or may not have received the information, but the vertex still does not believe in it.

  • Infected (I): it is aware and believes the information, but it still does not propagate it to its connections.

  • Spreader (P): it actively propagates the news to all of its successors in the network.

The distinction between an infected and a spreader node has already been proposed in the literature of information diffusion (see, e.g., Xiong et al. [66]). Note that we are not considering the possibility of a recovery, i.e., in this model an infected vertex cannot return to the susceptible state. This implies that we are not allowing an opinion change; that is, once a node is infected and believes some information, it cannot disbelieve it in the future. Even though this choice might seem unrealistic, it simplifies the analysis. Thus, we leave modeling opinion changes as a subject of future work.

Let us consider an online social network such as Instagram or Twitter. A spreader posts a comment or shares another post commenting a news. When will its followers become aware of the post? The answer of this question is complex, as it depends on many factors such as details of the online social network (e.g., the order in which posts are presented), habits of the successors, etc. For this reason, in this paper we assume, as a first approximation, a constant mean rate of reads of the news. In particular, the time between the moment a spreader posts a message and a follower reads it is modeled by an exponential random variable with constant parameter \(\lambda \).

A person may have the potential to be very influential. For example, he/she may be very charismatic, a great speaker or writer, or an excellent producer of memes. We characterize the spreading ability of node i by the parameter \(\beta _i^{{\max }}\in \mathbb {R}^{\ge 0}\). However, that particular person may not use all of its potential, e.g., it may depend on his/her current involvement with the subject. The influence capacity of node i at time t is modeled by \(\beta _i(t)\in \mathbb {R}^{\ge 0}\), which is always smaller than or equal to \(\beta _i^{{\max }}\). The actual current influence capacity of node i is determined by the influence power of those who have contacted it:

$$\begin{aligned} \beta _i(t) = \beta _i^{{\max }}\times \Psi _i\left( \sum \limits _{t_{ji}\le t} \beta _j\left( t_{ji}\right) \right) , \end{aligned}$$
(5)

where \(\Psi _i:\mathbb {R}\rightarrow [0,1]\) is a non-decreasing function and \(t_{ji}\) is the time when node i read the information published by its predecessor node j. Intuitively, the involvement of vertex i with the news and, hence, its spreading power depends on the strength with which node i received the information from its predecessors. Although more complex alternatives are possible, for the sake of simplicity we assume that

$$\begin{aligned} \Psi _i(x) = {\left\{ \begin{array}{ll} 0 &{} x<\psi _i,\\ \frac{x-\psi _i}{\Delta _i} &{} \psi _i\le x\le \psi _i+\Delta _i,\\ 1 &{} x > \psi _i+\Delta _i, \end{array}\right. } \end{aligned}$$
(6)

where \(\psi _i,\Delta _i\in \mathbb {R}^{\ge 0}\) are two parameters. While the value of \(\psi _i\) quantifies the ‘resistance’ of node i to become a spreader, \(\Delta _i\) specifies how quickly it achieves its maximum influence power. Melnik et al. [67] also considered nodes with different influence capacities. However, their influence was discretized in a finite set of levels.

Susceptible nodes become infected if they have been exposed to sufficiently strong influences. In particular, node i goes from state S to state I after being informed of the news by node k if

$$\begin{aligned} \sum \limits _{t_{ji}\le t_{ki}} \beta _j\left( t_{ji}\right) \ge \phi _i, \end{aligned}$$
(7)

where \(\phi _i\) quantifies the resistance of node i to believe any piece of information. Since we assume that a node must be infected, i.e., it must believe in the information, before spreading it, we require \(\psi _i\ge \phi _i\). It must be noted, though, that there may be situations in which it might be convenient to assume that a spreader does not necessarily believe the news. A related approach was proposed by Huang et al. [68] that distinguished between infected and persuader nodes, with different thresholds to enable the transition from the susceptible state. In their setting, all infected nodes are propagators, but persuader agents have a larger influence power as they ‘immediately’ infect (persuade) all their neighbors.

It is important to observe that a node might need to be exposed to the same news from different sources in order to believe it or to start spreading it. However, this number depends on the influence capacity of its predecessors (\(\beta _j(t_{ji})\)) and the resistance of the node to believe (\(\phi _i\)) and propagate the news (\(\psi _i\)). Contagion paradigms where exposures from more than a single neighbor are needed in order to become infected have been dubbed in the literature as complex contagion models. The seminal work by Centola and Macy [69] considered the case where a node became infected only if the fraction of infected neighbors was greater than a threshold. While Centola and Macy [69] focused on the structural properties of networks that facilitate or hinder contagion cascades, it also studied the impact of network heterogeneity and even the existence of some higher-status nodes (in the context of our model, nodes with high \(\beta _i^{{\max }}\)). In this sense, mathematical details aside, our proposal introduces a new concept, that of the involvement of a node in the propagation of a news as dependent on the status of those from whom it received it. This approach can be related to the observation that the propagation of a piece of information depends on the strength of the link on which that information was received [70]. However, let us observe that, in our model, the influence power is a property of the node and not of the link between nodes.

There is another significant difference between this paper and most of the literature on complex contagions. Indeed, the vast majority of the papers are focused on the modeling side [68,69,70,71,72,73,74,75,76,77,78] or, in social and behavioral change applications, concerned in the best seeding strategies to diffuse a new cultural paradigm [79,80,81,82,83,84,85]. On the contrary, we are interested in how to stop or hinder the diffusion of a news. Profound analysis of how information propagates can shed light on how to combat the spread of news [86]. However, we turn to a more straightforward approach by looking for the nodes that should be combated in order to most efficiently thwart diffusion. In the context of complex contagions, this approach has already been explored by Centola [87] that studied the robustness of diffusion in scale-free and exponential networks. However, Centola only analyzed the size of the infection cascade and the influence of removing either the highest degree or randomly chosen nodes. Kuhlman et al. [88] extended the work of Centola by analyzing other criteria for node selection. They showed that, under their modeling framework, selecting a minimal set of nodes to block infection is NP-hard and proposed efficient heuristics. Their work was further extended for more than one ‘illness’ in Carscadden et al. [89]. We must remark that all these works do not consider the influence potential of the nodes as we do in this paper. In general, the problem of finding an optimal set of nodes to be removed is dubbed as the critical node detection problem and there is a vast literature on the subject. The interested reader is referred to the review by Lalou, Tahroui and Khedouci [90]. It is interesting to note, however, that most papers consider that nodes can actually be removed from the network. Cavallaro et al. [91] argue that this may not be always the case, as some nodes may resist to be removed. While do not pursue this idea further in this work, it deserves to be studied in the future.

A simple alternative to stop news diffusion is to attack the first infected nodes, but in practical settings those nodes might be discovered only when the infection has started to propagate and, thus, it would be inefficient as an immediate reaction. Moreover, in an online social network, new infections may come from external sources [70, 92] out of the control of the social network itself. Therefore, if it does not make sense to look for the first news-spreaders, what other nodes should be attacked or even banned from network? This is the fundamental question we try answer in this work.

4 Numerical simulations

Since the time between the moment a spreader posts a message and a follower reads it is modeled by an exponential random variable with constant parameter \(\lambda \), our simulations are based on Gillespie’s algorithm [93]. In particular, the time to the next infection, \(\delta T\), is simulated as

$$\begin{aligned} \delta T = -\frac{\ln (r_1)}{\sum _{j=1}^n \sum _{i\in \mathcal {P}_j} \lambda }, \end{aligned}$$
(8)

where \(r_1\) is a random number in (0, 1), \(\mathcal {P}_j\) is the set of predecessors of j that are spreaders and have yet not infected j. We determine which node k is infected by generating a new random number \(r_2\) with a uniform distribution in (0, 1) and looking for the minimum k such that

$$\begin{aligned} \sum \limits _{j=1}^k \sum \limits _{i\in \mathcal {P}_j} \lambda \ge r_2\times \sum \limits _{j=1}^n \sum \limits _{i\in \mathcal {P}_j} \lambda . \end{aligned}$$
(9)

Finally, we assume an ordering of \(\mathcal {P}_k\) and choose the infecting node as the first one such the sum on the left is greater or equal than that on the right.

Table 1 Simulation parameters—Barabási–Albert network

In order to understand the workings of the proposed model, we begin by studying its behavior in synthetic networks of the type of Barabási–Albert [94]. In particular, we start with a network with \(n = 1000\) nodes where each new added node attaches to other \(m = 5\) vertices according to the preferential attachment mechanism. Since we shall focus on networks where most of the nodes are similar and only a few of them stand out as great potential influencers, we assign attributes randomly to each vertex according to the following rules: \(\beta ^{\max }_i\sim U(0.9,1.1)\), \(\phi _i\sim U(0.9,1.1)\), \(\psi _i\sim U(\phi _i,1.1)\), where U(ab) corresponds to the uniform distribution between a and b. The choice of the distribution is not significant, and it only introduces a small heterogeneity among nodes.For the sake of reference, we summarize all simulation parameters in Table 1.

Our interest is to find which nodes are more convenient to remove from the network to either stop or alleviate the propagation of the news. We shall evaluate the convenience of removing nodes according to some topological metric such as the out degree, the closeness centrality, the eigenvector centrality, the PageRank centrality, and the betweenness centrality. We also study the benefit of removing nodes with largest influence potential \(\beta _i^{\max }\) and, for the sake of comparison, randomly chosen vertices. After removing 10 nodes based on one of these criteria, we choose other 10 nodes at random to become initial spreaders with its maximum influence power, i.e., \(\beta _i(0) = \beta _i^{\max }\). While the choice of the number of removed nodes is, in principle, arbitrary, it must be observed that it agrees with \(n_\textrm{gr}\), the number of nodes with \(\beta ^{\max }_i = \beta _\textrm{gr}\). That is, if we remove some of the more influential nodes, we remove all of them. By the end of this section we consider the influence of this decision by varying \(n_\textrm{gr}\). We must also mention that the set of firstly infected nodes may or may not have intersection with the set of more influential nodes. This choice is in agreement with previous studies [95].

Fig. 1
figure 1

Boxplots for the number of infected and spreader nodes after 2 time units (left) and after an infinite amount of time (right). Outliers are not presented for the sake of clarity. Each boxplot corresponds to the result after the removal of randomly chosen nodes (Random), vertices with the highest out degree (\(d^\textrm{out}\)), closeness (\(c_{cl}\)), eigenvector centrality (\(c_{eg}\)), PageRank (\(c_{pr}\)), betweenness (\(c_{be}\)), and influence potential (\(\beta ^{\max }\))

Fig. 2
figure 2

Temporal evolution of the mean (left) and standard deviation of the number of infected and spreader nodes in Barabási–Albert networks with \(m=5\) connections made by each new added node

Fig. 3
figure 3

Temporal evolution of the mean (left) and standard deviation of the number of infected and spreader nodes in Barabási–Albert networks with \(m=2\) connections made by each new added node

Figure 1 presents boxplots showing the first three quartiles of the number of infected and spreaders. The results correspond to 10,000 simulations with different networks. We used \(\lambda = 1\) and the panel on the left shows the number of I and P after 2 time units, while the right panel shows the numbers after an infinite time period. We do not present the outliers in the boxplot graph for the sake of clarity. In principle, there does not appear to have any clear advantage using any of the proposed metrics and, in general, random removal of nodes is worst. Thus, in the remaining of the paper and for the sake of clarity, we shall focus our attention on the removal of nodes with the highest out degree \(d^\textrm{out}_i\) and with the highest influence potential \(\beta ^{\max }_i\).

The temporal evolution of the mean and the standard deviation of the number of infected and spreader nodes is shown in Fig. 2. As it can be observed, removing the vertices with highest \(\beta ^{\max }_i\) not only is the best option on the average, but it also exhibits the lowest standard deviation. Nonetheless, the spread of values is such that it might be possible that, in a number of situations, removing the nodes with largest degree might result advantageous. Since the large standard deviation appears to be inherent to the Poissonian contagion processes, we shall focus on the average behavior in the remaining of this work.

It is interesting to observe the influence of the network density in the temporal evolution determined by our model. Figures 3 and 4 show results for Barabási–Albert networks where each new added node links to \(m = 2\) and \(m = 10\) existing nodes, respectively. As it can be observed, removing nodes with higher degree is more effective than removing vertices with higher influence potential when the density of the network is low. The intuition behind this result comes from the fact that super-spreaders need their neighbors to be well connected so that their influence spreads through the network. Therefore, networks with higher density are more susceptible to damage caused by nodes with high influence power.

Fig. 4
figure 4

Temporal evolution of the mean (left) and standard deviation of the number of infected and spreader nodes in Barabási–Albert networks with \(m=10\) connections made by each new added node

Fig. 5
figure 5

Influence of \(\beta _\textrm{gr}\) (left) and \(\Delta _i\) (right). A logarithmic scale is used in the second figure in order to better appreciate the differences between the three curves

Fig. 6
figure 6

Influence of \(\overline{\phi }\) (left) and \(\overline{\psi }\) (right). A logarithmic scale is used in the first figure in order to better appreciate the differences between the three curves

In order to understand the impact of each parameter in the model, we vary them while showing the results after 2 time units for the case of the Barabási–Albert graph with \(m = 5\) new links with each new node. The left panel of Fig. 5 shows that, as expected, the removal of the nodes with higher \(\beta ^{\max }_i\) becomes more effective as the value of \(\beta _\textrm{gr}\) increases. We can arrive at a similar conclusion in relation to changes to \(\Delta _i\), as shown in the right panel of Fig. 5.

Fig. 7
figure 7

Influence of \(\overline{\beta }^{\max }_i\) (left) and of the number of nodes with \(\beta ^{{\max }}_i = \beta _\textrm{gr}\) (right)

The left panel of Fig. 6 presents results when the mean value of \(\phi _i\) is changed. These results correspond to assigning the attributes \(\phi _i\sim U(\overline{\phi }-0.1,\overline{\phi }+0.1)\), \(\psi _i\sim U(\phi _i,\overline{\phi }+0.1)\). It is immediate to observe that as the resistances of the nodes to become infected and spreader increase, the number of infected vertices decreases. For small values of \(\phi _i\) (\(\overline{\phi } < 0.5\)), nodes are easily infected and, thus, it is convenient to remove higher degree nodes as there is no need for a large influence power for the spread of misinformation. However, when resistance to infection increases (\(\overline{\phi }\in (0.5,1.0)\)), influence power of spreading nodes becomes more relevant and it is more effective to remove nodes higher \(\beta ^{\max }\). Finally, for large resistance values (\(\overline{\phi } > 1\)), there is not a significant difference between the behavior when the nodes of higher degree or the nodes with largest \(\beta ^{\max }\) are removed. Since the initially infected nodes are a small portion of the population and the resistance to infection is high, it may be the case that the infection does not reach nodes with high degree or large \(\beta ^{\max }\) by the time considered in Fig. 6 (2 time units).

Results where only the value of \(\psi _i\) is changed are shown in the right panel of Fig. 6. In particular, nodes received the value \(\psi _i\sim U(\max (\overline{\psi }-0.1,\phi _i),\overline{\psi }+0.1)\). Observe that, for higher values of \(\overline{\psi }\), it becomes more effective to remove nodes with higher degree. As a matter of fact, once the nodes with higher \(\beta ^{{\max }}\) have spread the information as much as they can (if they are present), most of the contagions are due to ‘regular’ nodes. If the resistance to become a spreader increases, there is a need for a larger number of contagions to transform an infected node into a spreader. A larger number of contagions is possible if the node has a high in degree. Since the Barabási–Albert network is, in effect, undirected, removing the nodes with high degree hinders this process.

We vary the influence potential of regular nodes, by assigning \(\beta ^{\max }_i\sim U(\overline{\beta }^{\max }-0.1,\overline{\beta }^{\max }+0.1)\) and changing the mean value \(\overline{\beta }^{\max }\). As it can be observed in the left panel of Fig. 7, the removal of the nodes with higher \(\beta _i^{\max }\) becomes less effective as \(\overline{\beta }^{\max }\) increases. Intuitively, as the regular nodes become potentially better spreaders, the importance of nodes with large influence power diminishes in comparison.

We also evaluate the influence of \(n_\textrm{gr}\), the number of higher-status nodes with large \(\beta ^{{\max }}_i = \beta _\textrm{gr}\). Keeping constant in 10 the number of removed nodes and the number of initially infected nodes, as the number of nodes with \(\beta ^{{\max }}_i = \beta _\textrm{gr}\) increases, it becomes less convenient to remove them, in comparison with the removal of nodes with high degree. As it can be observed in the right panel of Fig. 7, when \(n_\textrm{gr} > 80\), the removal of high-degree nodes is more efficient. This suggests that the removal of high \(\beta _i^{\max }\) is more important when a large portion of them is removed from the network. If this is not possible, then it is better to focus on the removal of high-degree nodes.

In previous simulations, the value of the influence power \(\beta ^{\max }_i\) was uncorrelated with the degree. Although this is a convenient starting point of analysis, in real-world scenarios a relation between both quantities is to be expected. In order to understand the consequences of a correlation between them, we set up networks as those used in Fig. 2. However, the nodes with \(\beta ^{\max }_i=\beta ^{\textrm{gr}}\) were not uniformly chosen at random from the set of all vertices, but with a probability proportional to \((d^\textrm{out}_i)^\alpha \). The value of \(\alpha \in \mathbb {R}\) changes the correlation between the out degree and the influence power, but it does not alter their marginal distributions. Figure 8 shows the mean (left panel) and the standard deviation (right panel) of the number of infected nodes at time \(t=5\), for 10,000 simulations and \(\alpha \in [-10,+10]\). As it can be readily observed, there is a range of values of \(\alpha \), corresponding to both negative and positive correlations, for which the deletion of the nodes with higher influence capacity is the more convenient option. For \(\alpha < -2\), nodes with large \(\beta ^{\max }_i\) are poorly connected and, thus, it is more convenient to isolate vertices with higher degree. On the contrary, for \(\alpha > 3\), nodes with higher degree are also (with high probability) nodes with high influence power. Hence, isolating either with respect \(d^\textrm{out}_i\) or \(\beta ^{\max }_i\) yield almost the same results.

Fig. 8
figure 8

Mean (left) and standard deviation (right) of the number of infected and spreader nodes in Barabási–Albert networks with \(m=5\) connections made by each new added node. Nodes with \(\beta ^{\max }_i=\beta ^{\textrm{gr}}\) were chosen with a probability proportional to \((d^\textrm{out}_i)^\alpha \)

Fig. 9
figure 9

Complementary cumulative distribution function of the spreading capacity (left) and capacity as a function of out degree (right) in the Twitter network

4.1 Twitter network

The use of synthetic networks has the advantage of allowing control their characteristics; however, their structures may be somewhat different from that of networks found in the wild. Moreover, the Barabási–Albert graph is intrinsically undirected and we are more interested in the directed propagation of information. For these reasons, in this section we apply our model to real network data. In particular, we use data from Twitter that was first presented in De Domenico et al. [96]. The dataset was built by following the messages posted in Twitter about the discovery of the Higgs boson between 1 and 7 July 2012, and it contains the network of Twitter followers as well as statistics on retweets, replies and mentions.

From the social graph, we kept only the largest strongly connected component consisting of 360,210 vertices and 14,102,583 directed edges. In order to represent the flow of information, edges were directed as going from accounts to each of their followers. One very interesting characteristic of the dataset is that there are several non-topological properties that may be considered as measures of influence power; for example, the number of times an account was mentioned or the number of retweets of the account’s posts. Thus, let us assume the number of mentions \(n_i^{\textrm{men}}\) that node i received as corresponding to its spreading capacity \(\beta _i^{{\max }}\). However, given the short period of time (7 days) that was surveyed, not every node was mentioned. In order to allow for each node to have some spreading capacity, we set \(\beta _i^{\max } = n_i^{\textrm{men}}+1\). The left panel of Fig. 9 shows the complementary cumulative distribution function of the spreading capacity, showing that a large fraction of the nodes has a small value of \(\beta ^{\max }\).

Table 2 Simulation parameters—Twitter network
Fig. 10
figure 10

Temporal evolution of the mean (left) and standard deviation of the number of infected and spreader nodes in the Twitter network

It is interesting to analyze the relation between the spreading capacity of each node in the resulting network with some its topological properties. The right panel of Fig. 9 presents a scatter plot of the spreading capacity as a function of the out degree. The correlation coefficient between both quantities is \(\sim 0.37\), where its positivity is related to the observed tendency of higher \(\beta ^{\max }\) to correspond to larger \(d^{\textrm{out}}\). However, the correlation coefficient is not close to unity and the popularity of a node, measured here by the number of mentions, is not perfectly correlated with a topological property such as its degree. This fact is on the basis of our proposal, i.e., that there may exist some non-topological properties of vertices which are important in the dissemination of news.

Since some nodes reach values of \(\beta ^{{\max }}>10^4\), we set a large value of \(\Delta _i\), namely, \(\Delta _i = 100\). The remaining parameters are randomly assigned: \(\phi _i\sim U(0.9,1.1)\), and \(\psi _i\sim U(\phi _i,1.1)\). For our simulations, we arbitrarily set \(\lambda = 1\) (arbitrary units). Simulation parameters are summarized in Table 2. Figure 10 shows the results for 100 realizations when the number of removed nodes and initial infected vertices were set to 1% the graph size, that is, 3602 nodes. As it can be readily observed, it is more advantageous to remove nodes with higher influence power than, for example, vertices with higher degree, as we have already observed in the synthetic networks.

Fig. 11
figure 11

Influence of \({\Delta }_i\) in the Twitter network. A logarithmic scale is used in order to better appreciate the differences between the three curves. The parameters \(\phi _i\) and \(\psi _i\) were randomly assigned as \(\phi _i \thicksim U(0.9, 1.1)\) and \(\psi _i\thicksim U(\phi _i,1.1)\)

As we did in the previous section, we study the influence of model parameters on the results. Figure 11 shows the mean number of infected and spreader nodes at time \(t=5\) as the value of \(\Delta _i\) varies. The results correspond to 100 realizations where \(\phi _i\) and \(\psi _i\) were randomly chosen as described in the previous paragraph. It can be observed that the strategy that removes nodes depending on \(\beta ^{{\max }}\) is more convenient for intermediate values (\(\Delta _i\) \(\in (30,300)\)), while not representing any real advantage over the selection using the out degree for neither smaller nor larger values. This behavior is different from that observed in the synthetic graphs (cf. right panel of Fig. 5). A plausible cause for this difference may be the positive correlation between \(\beta ^{{\max }}\) and \(d^{\textrm{out}}\) in the Twitter network. This hypothesis is supported by the results in Fig. 8 for synthetic networks.

Fig. 12
figure 12

Influence of \(\overline{\phi }\) (left) and \(\overline{\psi }\) (right) in the Twitter network. While \(\Delta _i = 100\) was kept constant, \(\phi _i\) and \(\psi _i\) were varied in all realizations

The right panel of Fig. 12 shows the influence of the mean value of \(\psi _i\). The results correspond to 100 realizations, where \(\Delta _i = 100\) was fixed and \(\phi _i\) and \(\psi _i\) were varied in each simulation according to \(\phi _i\sim U(0.9,1.1)\), \(\psi _i\sim U(\max (\overline{\psi }-0.1,\phi _i),\overline{\psi }+0.1)\). As in the synthetic networks, the benefit of using the removal strategy based on the spreading capacity diminishes as the value of \(\overline{\psi }\) increases. However, a much larger change is needed in order to observe significant differences (compare the doubly logarithmic scales in the right panel of Fig. 12 to the linear ones in the right panel of Fig. 6).

The left panel of Fig. 12 exhibits the influence of the mean value of \(\phi _i\). However, it is difficult to disentangle its influence from that of \(\psi _i\) as it must be true that \(\psi _i\ge \phi _i\). The results correspond to 100 realizations, where \(\Delta _i = 100\) was fixed and \(\phi _i\) and \(\psi _i\) were varied in each simulation according to \(\phi _i\sim U(\overline{\phi }-0.1,\overline{\phi }+0.1)\), \(\psi _i\sim U(\phi _i,\overline{\phi }+0.1)\).

5 Conclusions

We studied the problem of mitigating the propagation of misinformation in networks. While most of the literature focuses on the characterization of influential agents based only on the topological properties related to their position in the social network, we introduced a new characteristic which is an intrinsic attribute of the node: its influence power.

To evaluate the impact of the influence power on the spread of fake news in the network, we put forth a new model of information diffusion. In this model, we related this attribute to an inherent potential of the agent and the intensity of its commitment to the news propagation which is, in turn, due to the strength with which such information was received. By means of numerical simulations on both synthetic and real-world networks, we showed that, in certain conditions, the influence power of the agent can be more important than its topological properties.

Our results suggest that the removal of nodes with high influence power is more effective in combating the misinformation diffusion in denser networks. Moreover, this method of mitigating fake news spread seems also to be more robust since it obtained the smallest standard deviation of the results. That strategy is also more convenient when the influence power of a few nodes is much larger than that of the general population. On the other hand, for the method to be effective it is necessary to remove a large portion of such super-spreaders.

As an important real-world example, we presented simulations based on a large Twitter dataset (> 300 thousand nodes). We posited that the number of times that a Twitter account is mentioned may represent its actual spreading capacity. Building on this idea, we showed that removing the nodes with higher influence power is more convenient than removing those with higher degree in order to curtail the propagation of a news.

We must remark that our main objective is to demonstrate that there can exist node attributes that are more important than other structural characteristics, such as their degree, for the propagation of information. Since the main application of our results is the obstruction of viral propagation of fake news or misinformation by selective removal of agents from a social network, we believe to have made a case for accounting for the node attributes in choosing which nodes to remove.

Finally, we call the attention to certain practical difficulties related to the removal of nodes from a social network. Indeed, the subject of government policies and legal actions in relation to misinformation is a delicate matter which is very actively discussed worldwide (see, e.g., [97,98,99,100,101,102] and references therein). Banning a user from a social network can be considered an act contrary to fundamental right of free speech. Moreover, legislation can be inadequately used by totalitarian governments or ill-intentioned corporations. All in all, the methodology of removing nodes with high influence capacity, if this capacity is appropriately defined and quantified, may be more transparent for lay people than other automatic or algorithmic approaches and, hence, more desirable [101].