Keywords

1 Introduction

Nowadays, the use of social media networks has become a necessary daily activity for people to interact with family and friends, access news, information and make decisions. Besides being a handy means for keeping in touch with friends and family, social media is more of a platform spreading the tremendous influence. Social media has great impact on businesses, politics, disease control and others. To this end, researchers have studied various practical problems in social media to better understand how the social media behaves and propagates the information. The problems include buzz prediction [7], volume prediction [24], infection prediction [3, 17], source prediction [21], link detection [19], target set selection [6, 8, 9, 14, 26] and firefighter problem [2].

1.1 Motivation and Problem Description

Here we focus on investigating the problem of target set selection. Formally, we define a social media network as a directed graph \(G=(V,E)\), where users are defined as all nodes V and their friendships are defined as all edges E. Users are active when they repost the messages. Target set selection problem refers to find a minimum subset \(S \subset V\), then all the nodes will be activated by S. The target set selection problem can be applied in the areas of viral marketing [11] and cyber security [4].

In the target set selection model, the seed nodes spread the influence until the diffusion process stops. The goal is to activate all the nodes. The influenced users refer to activated users who repost the messages. However, in reality, influenced users sometimes will repost the messages. But in most cases, even they are convinced or influenced by the message, users will not repost the message for certain reasons. In this case, influence not only refers to activation (repost) but also refers to the belief in the messages. Thus we build the influence and activation thresholds target set selection models to describe the situation. Here we introduce two thresholds, one is activation threshold \((\phi )\) and another one is influence threshold \((\theta )\). Users will be influenced first before be activated, thus we define \(\theta <=\phi \). The goal of the model is to influence all the nodes.

Our models are time-indexed integer program models, which can be divided into two parts, the first part is the information propagation. There are two widely used propagation models, namely Independent Cascade Model (IC) [14] and Linear Threshold Model (LT) [12]. Independent Cascade Model assumes every node has a single chance to activate its neighbors. In Linear Threshold Model, each node will be influenced by each neighbor according to a weight. When the total weights from its neighbors is larger than a threshold \(\phi \), then the node will be activated. In this paper, we propose all the mathematical models based on the Linear Threshold Propagation Model. Here we set the weights as 1 for all the nodes. Thus an inactive node will become active if at least \(\phi \) of its neighbors are active in the previous step. The second part is the influence dominating part, which means the users should be either active or influenced through having at least \(\theta \) of activated neighbors at time T.

1.2 Literature Review

To investigate the influence and activation thresholds target set selection problem, the research of target set selection problem offers us some good insights. Kempe et al. [14] study the maximum active set (under the name target set selection) and show that it is NP-hard. They also provide a greedy algorithm within provable approximation guarantees based on the submodularity property of the objective function. Chen et al. [9] study the minimum target set selection problem and show the problem is hard to approximate within a polylogarithmic factor. Besides, he comes up a polynomial-time algorithm to find an optimal solution when the underlying graph is a tree. Ackerman et al. [1] propose a combinatorial model for the minimum target set selection and prove the combinatorial bounds for the perfect target set selection problem. Shakarian et al. [22] present a time-indexed formulation to find the minimum seeds for the target set selection and come up with a scalable heuristic based on the idea of shell decomposition. Spencer et al. [23] consider the problem of how to target individuals with subsidy in the network in order to promote pro-environmental behavior. It is also a target set selection problem and they use a time-indexed integer program formulation with as many time periods as the number of nodes in the network to tackle the problem. Günneç et al. [13] study the variation of the target set selection problem called least cost target set selection on social networks, and they propose greedy algorithm and dynamic programming algorithm to solve the problem for the tree structure network. Raghavan et al. [18] develop and implement a branch-and-cut approach to solve the weighted target set selection problem on arbitrary graphs.

1.3 Contribution and Organization

To our knowledge, the previous research involving target set selection focuses on the single threshold (activation threshold) target set selection. In this paper, we propose practical influence and activation thresholds target set selection mathematical model and its computational algorithms correspondingly.

The rest of the paper is organized as follows. Section 2 presents the novel Minimum Influence and Activation Thresholds Target Set Selection Model. In Sect. 3, we propose several computational algorithms for solving large scale networks. Section 4 shows the experimental results of the proposed model and its corresponding computational algorithms. Section 5 concludes the article.

2 Minimum Target Set Selection Model with Influence and Activation Thresholds

In this section, we introduce a time-indexed integer program to find the minimum number of influential seeds for the influence and activation thresholds target set selection problem. An artificial time index t taking values from 0 to T is introduced to model the order in which nodes become active. The messages could propagate at varying distances through different forms of social media. Cha et al. [5] observe that even for popular photos, only 19% of fans are more than 2 hops away from uploaders on Flickr.com. Ye et al. [25] find that, on Twitter, 37.1 percent message flows spread more than 3 hops away from the originators. Thus here we set T as 0,1,2,3, which means we only consider the cascades less than or equal to three time steps. The formulation uses a binary variable \(x_{i,t}\) to represent the status of node i at time t, which is 1 if node i is active at time t and 0 otherwise. Here \(\theta \) represents the influence threshold and \(\phi \) represents the activation threshold. Nodes should always be influenced before activated, so we set \(\theta <= \phi \). N(i) represents the neighborhood of node i. The Minimum Influential Seeds Model is as follows:

$$\begin{aligned} \min _{\mathbf {x}} \quad&\sum _{i \in V} x_{i,0} \end{aligned}$$
(1)
$$\begin{aligned} s.t. \quad&\theta x_{i,T}+\frac{1}{|N(i)|}\left\{ \sum _{j \in N(i)} x_{j,T}\right\} -\theta \ge 0&\forall \, i \in V \end{aligned}$$
(2)
$$\begin{aligned}&\frac{1}{|N(i)|}\left\{ \sum _{j \in N(i)} x_{j,t}\right\} +\phi x_{i,t} \ge \phi x_{i,t+1}&\forall \, i \in V, t<T \end{aligned}$$
(3)
$$\begin{aligned}&\frac{1}{|N(i)|}\left\{ \sum _{j \in N(i)} x_{j,t}\right\} +\phi x_{i,t} \le \phi -\epsilon +(1+\epsilon )x_{i,t+1}&\forall \, i \in V, t<T \end{aligned}$$
(4)
$$\begin{aligned} \begin{array}{rcll} \theta &{}\in &{} (\,0,1\,] &{} \leftarrow \text {influence threshold}\\ \phi &{}\in &{} (\,0,1\,] &{} \leftarrow \text {activation threshold}\\ \theta &{}\le &{} \phi &{} \leftarrow \text {node is influenced before activated}\\ N(i) &{}=&{} \{\;j: (i,j) \in E\;\} &{} \leftarrow \text {Neighborhood, adjacent nodes}\\ \end{array} \end{aligned}$$

The objective function (1) aims to minimize the seed nodes activated at time 0. Constraints (2) are influential constraints, making sure that all the nodes should be either active or be influenced by at least \(\theta \) of active neighbors at time T. Constraints (3) refer that a node i will stay inactive at time \(t+1\) when it is not activated at time t. Constraints (4) restrict that a node will stay active if it is originally active, which means when \(x_{i,t}=1\), \(x_{i,t+1}\) should be 1 as well. In addition, the constraints make sure that a node will become active at time \(t+1\) when it is activated at time t, which means when \(x_{i,t}=0\), the influence from its neighbors is larger or equal than \(\phi \), then \(x_{i,t+1}=1\). Here we introduce two \(\epsilon \), the first \(\epsilon \) restricts that node i should be active at time \(t+1\) even if the influence from its’ neighbors is \(\phi \). The second \(\epsilon \) confirms that when \(x_{i,t}=1\) and all the neighbors of i are active, the node i being active at time \(t+1\) still holds.

3 Computational Algorithms

The time-indexed integer program model proposed in Sect. 2 is computationally intractable unless in very small instances because of the large number of binary variables. However, social media networks are usually in an extremely large scale. Thus we apply multiple computational algorithms to tackle the influence and activation thresholds target set selection model for larger scale networks in this manuscript. More details will be discussed in the rest of this section.

3.1 Graph Partition

When the social media network is large-scale, solving the models exactly through Gurobi is very difficult. The most intuitive way is to solve multiple smaller subgraphs instead of one large graph. Here we use techniques from Modularity and Community Structure [10] in networks to divide the large graph into several smaller subgraphs. Then we solve the models exactly separately for each subgraph.

3.2 Nodes Selection

When we’re dealing with influence and activation thresholds target set selection problem for large-scale network, the large number of binary decision variables, which are |V||T| in total, makes the problem difficult to solve. In order to accelerate the computational speed, we could reduce the decision variables through adding some constraints to restrict that some of the nodes are not selected or some of the nodes should be selected. Here we come up with two methods, one is to delete the leaf nodes, and another is to choose the nodes with high degree.

Leaf Nodes Deletion. Leaf nodes in a connected graph may not be seeded because they’ll influence or activate at most one neighbor directly. Thus we add the constraints (5) to remove the option of activating leaf nodes. In other words, all the leaf nodes will not be seeded using the method.

$$\begin{aligned} x_{i,0} +1&\le |N(i)| \end{aligned}$$
(5)

Degree Centrality Selection. Nodes with high degree have more potential to influence and activate other nodes. Therefore, we assume the high degree nodes must be seeded. Here \(\frac{1}{|V|} \ll \rho < 1\) is defined as the criteria for choosing the seed nodes. When the total neighbors |N(i)| of node i is larger than \(\rho |V|\), the node i will be seeded. Thus we add the constraints (6) to the original model in order to choose the nodes with more than \(\rho |V|\) neighbors as seed nodes.

$$\begin{aligned} x_{i,0} + \rho&\ge \frac{|N(i)|}{|V|} \end{aligned}$$
(6)
$$\begin{aligned} \rho = \epsilon&\text { implies that } N(i) = 1 \text { for a connected graph}\\ \rho = 1-\epsilon&\text { implies that } N(i) = V-1 \end{aligned}$$

The larger the \(\rho \), the nodes with higher degree will be selected as seed nodes. When \(\rho \) is \(\epsilon \), the nodes having neighbors will all be selected. When \(\rho \) is \(1-\epsilon \), only the node connecting to all the other nodes will be selected.

3.3 Greedy Algorithm

We propose the greedy algorithm for the Minimum Influence and Activation Thresholds Target Set Selection problem. The greedy algorithm selects the seed node with the largest number of inactive neighbors in each iteration and adds it to the seed node set S until the stop conditions have been met. Then we update the nodes threshold and active set in each iteration considering the propagation process. Here we update the threshold and active set based on the Breadth First Search (BFS).

For the BFS Search Greedy Algorithm shown in Algorithm 1, firstly we choose the seed node with the largest number of inactive neighbors and add it to the seed set S. Then we update the threshold and activation time step (T) of an inactive neighbor node by adding the influence sent from the activated seed node. BFS search starts at the tree root and explores all of the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level. Here we use a queue Q to store the parent nodes which will spread the influence within propagation step P.

figure a

4 Computational Algorithms Comparison

In this subsection, we assess and draw comparisons between different computational algorithms introduced in the Sect. 3 for the minimum influence and activation thresholds target set selection model.

We consider a subset of real-life social networks as datasets in our experiment: Karate Club [27], Hamster Friendships Network [15], Facebook Network Dataset [16] and LastFM Social Network Dataset [20]. We set the time limit of 3600 s for bold methods in the following experiments. For the method of degree centrality, we set the \(\rho \) as 0.2.

The Karate network is a network of very small size and the result is shown in Table 1. For the Karate Club network, we could solve the model directly. However, the Leaf Node method could accelerate the computation slightly without sacrificing the performance. For larger size network of Hamster Dataset, the Graph Partition method couldn’t generate a solution in one hour, so we don’t include here in Table 2. For Leaf Node method, the model is not feasible which means we couldn’t exclude all the leaf nodes as seed nodes for Minimum Influential Seeds model. We could see from the table the Original Model even performs better than the Degree Centrality method within the time limit of 3600 s. In addition, BFS Greedy method offers good solutions within much less time compared to the Original Model. The results of Facebook1 dataset are shown in Table 3. Facebook 1 is a dataset of low density. The Facebook 1 network has a large number of nodes with few friends. Thus it is easy for Gurobi to solve it directly. However, the Leaf node method has the shortest computation time for this dataset. The results of LastFM Asia Dataset are shown in Table 4, here Leaf Node performs the best compared with Original Model and Degree Centrality methods within the time limit of 3600 s.

Table 1. Karate network
Table 2. Hamster dataset
Table 3. Facebook 1 Dataset
Table 4. LastFM Asia Dataset

In summary, for the small size datasets, we could solve the problem directly using Gurobi. For the network of low density, especially when large portion of the nodes have few neighbors(long-tailed network), we could consider the Leaf Node method and Degree Centrality method. For the larger size datasets, normally the BFS Greedy will have better performance. The Graph Partition has poor performance and long computational time for the selected social media networks. For the Graph Partition method, it could result from the structure of network which is hard to divide into subgraphs. Furthermore, even it is divided properly, sometimes the size of the subgraph is still hard to solve directly.

5 Conclusion

The increasing popularity of social media networks has created the need for businesses, politicians and organizations to find influential users in social media to spread the influence. In this work, we have addressed the problem through developing the minimum influence and activation thresholds target set selection model. Our model allows us to find the minimum seed nodes that influence all the nodes at time T. In addition, we provide different computational algorithms to tackle the various datasets as well. They are Graph Partition, Leaf Node, Degree Centrality and BFS Greedy computational algorithms. Experiments in various datasets show that BFS Greedy is much more efficient than the other methods for large size datasets. Besides, leaf node deletion and degree centrality selection perform better in terms of long-tailed network.