Keywords

1 Introduction

With the popularity of online social networks, such as Facebook, Twitter, WeChat, etc., the online social networks play an increasingly important role in daily communication among people. Many researchers have studied the diffusion phenomenon in social networks, such as the diffusion of news and opinions [1, 2], the adoption of products [3], the spread of infectious diseases [46], etc. Influence maximization is a fundamental problem of the diffusion in social networks. An application of influence maximization is viral marketing [3, 7, 8]. There have been extensive commercial instances of viral marketing succeed in real life, such as Nike Inc., used orkut.com, and facebook.com to market products successfully [9] and the Hotmail phenomenon [10].

Focusing on how to model the diffusion process, some researchers have proposed various diffusion models for the diffusion of innovations, ideas, etc. [1215]. Randomness, the cumulative effect and the decay characteristic are the main characteristics of propagation. Most of the existing models describe the first two characteristics. But few researchers focus on the decay characteristics of influence diffusion. In short, the diffusion decay refers to the decay of influence during the diffusion. For example, Tom read an interesting piece of news, and then he may forward it to his friends with probability p on the first day. But if not, he still may forward it with probability p′ on the second day (p′ < p), and p′ decreases with time. That is, in the real diffusion process, the influence will decrease with time going and is reflected by the decreasing of the activate probability of node. Thus, a model that does not consider diffusion decay cannot simulate the actual spread process well. Furthermore, it is critical to model the spread process with diffusion decay for analyzing the influence maximization problem of social networks, which exactly we will solve in this paper.

In our study, we focus on the problem of selecting the seeds to maximize the influence spread considering diffusion decay in a social network. For this purpose, we consider the following problems:

  1. (1)

    How to model the spread process with diffusion decay?

  2. (2)

    How to select k seeds to maximize the influence spread?

For the problem (1), it is natural to consider extending the classic independent cascade model (IC model) [8] to incorporate the influence probability decaying with time, which is called as cascade model with diffusion decay, abbreviated as CMDD. In the CMDD, the activate probability is influenced by the following three factors: the previous cumulative effects, the influence power of new activated neighbor nodes and the decay factor. The probability of node v is a function of these three factors, which can well reflects realistic characteristics of influence spread in a social network.

For the problem (2), selecting k seeds to maximize the influence spread under the CMDD is NP-hard. Whether the CMDD model defined upon the IC model still keeps the monotonicity and submodularity is the key and difficult part in our work. We prove the monotonicity and submodularity of the objective function, and thus the greedy algorithm can be used to approximate the optimal result with the ration of (1 − 1/e) based on the theoretic conclusion given by Nemhauser et al. [11].

In order to test the feasibility of the method proposed in this paper, we implement our algorithms and make corresponding experiments.

The reminder of this paper is organized as follows. In Sect. 2, we introduce related work. In Sect. 3, we give CMDD to model the influence spread of node. In Sect. 4, we obtain the objective function of influence maximization under the CMDD, and prove the monotonicity and submodularity of this objective function. In Sect. 4.2, we exploit the approximation algorithm to maximize the influence spread. In Sect. 5, we show the experimental results and performance studies. Finally in Sect. 6, we conclude and discuss the further work.

2 Related Work

Domingos et al. [7] discussed the influence maximization as an algorithm problem for the first time, and they modelled custom network as a graph and used a Markov random filed to calculate the influence probabilistic among them. In the aspect of modeling the diffusion process of influence, many researchers proposed various methods of influence maximization from various perspectives [8, 1215].

Kempe et al. [8] formulated the problem of selecting a set of influence individuals to maximize the influence spread as a discrete optimization problem and proposed independent cascade model (IC model) and linear threshold model (LT model) based on earlier works [1619]. The key feature of the model is that diffusion events along every arc in the social graph are mutually independent [20]. The LT model reflects the influence cumulative effect during the process of propagation, and the IC model can reflect the randomness of node activation. In this paper, our CMDD is based on the IC model, and it not only retains the cumulative effect in LT model, but also describes the influence decay during the diffusion. Our CMDD retais the monotonicity and submodularity in both LT and IC models.

Saito et al. [12] presented a method for predicting diffusion probabilities by using the Expectation Maximization algorithm based on the IC model. Yang and Leskovec [14] presented the linear influence model (LIM) to model the global influence of nodes. Goyal et al. [13] proposed three models: static model, continuous time model, discrete time model, in which the influence probabilities are relative to the action log instead of the discrete time step. In their works, dynamic activate probability is not discussed. In our work, we employ Inf to describe node influence power that can reflect not only the graphic characteristics but also some actual factors. The node’s activate probability is changing with recently activated neighbours and the decay factor during the diffusion.

In the time-critical influence maximization problem, Chen et al. [23] extended the IC model and the LT model to incorporate the time delay aspect of influence diffusion, but the diffusion decay is not considered. Liu et al. [24] defined time constrained activate probability which is an assumed value at different times. In CMDD, we mainly consider that the influence diffusion probabilities of nodes decay with varying time step. Actually, the probability at time t is a function of the probability at time t−1, the decay factor and the influence by neighbours which are activated at t−1, which is dynamic.

In the aspect of how to select seeds, many researches proposed heuristics and tried to solve the influence problem more efficiently [8, 21, 22]. In terms of algorithm design, our work follows the idea given in [8]. To select the optimal seeds is NP-hard under the CMDD, and then the greedy algorithm is used to approximate the optimal result based on the mathematical theory given in [11].

3 Cascade Model with Diffusion Decay

A social network is denoted as an undirected graph G = (V, E), where V is the set of nodes representing individuals and E is the set of edges representing the relationships among individuals. There are two classic diffusion models. One model describing how influence spreads in social network is LT model [8], which considers the influence accumulation of diffusion with time steps. Another model is IC model, in which an activated node u tries to activate its neighbor v with initialized p uv only once [8].

In this paper, we propose the CMDD based on the IC model. CMDD combines the time step characteristics of influence diffusion and the influence accumulation. In this model, each node is either active or inactive. At step t, the node v is activated with probability \( p_{v}^{t} \), which can be described as follows:

$$ p_{v}^{t} = \alpha \times p_{v}^{t - 1} + \frac{{\sum\nolimits_{{w \in A_{t - 1} \cap N(v)}} {Inf_{w} } }}{{Inf_{v} + \sum\nolimits_{u \in N(v)} {Inf_{u} } }} $$
(1)

where A t−1, N(v) and α denote the activated nodes at step t−1, the neighbors of node v and the decay parameter of influence respectively, where 0 ≤ α ≤ 1. This decay parameter can be denoted as a constant or an exponential function with parameters depending on the time. In order to facilitate the discussion, we employ a constant to denote the decay parameter. For α = 0, this model is similar to the IC model. For α > 0, this model can reflect the random property and the influence accumulation of the LT model. The greater the value of α, the slower the process of influence decay. Inf w denotes the influence power of node w, such as the node’s importance degree. N(v) denotes neighbors of v and A t−1 denotes the nodes set which are activated at time t−1.

Example 1.

Figure 1 shows an example of the diffusion process of CMDD. We assume α = 0.8 and Inf v is the degree of node v. Initially at t = 0, one seed v 6 is activated. At step t = 1, v 6 tries to activate its inactive neighbors with probabilities \( p_{{v_{1} }}^{1} = 0.308 \), \( p_{{v_{3} }}^{1} = 0.267 \), \( p_{{v_{7} }}^{1} = 0.4 \) and \( P_{{v_{8} }}^{1} = 0.4 \) respectively. At step t = 2, v 1 and v 7 are randomly activated, but v 3 and v 8 are inactive, and then the activated probabilities of v 2, v 3 and v 8 are \( p_{{v_{2} }}^{2} = 0.375 \), \( p_{{v_{3} }}^{2} = 0.547 \) and \( p_{{v_{8} }}^{2} = 0.32 \). Similarly, we can obtain the activated probabilities of nodes at step t = 3.

Fig. 1.
figure 1

The diffusion process of the CMDD (α = 0.8). In (a), (b) and (c), grey nodes denote the newly activated nodes, and the black nodes denote the activated nodes that lose the activated ability, other nodes are inactive.

4 Maximizing Influence Spread Under CMDD

In this section, we define the objective function of influence maximization problem under the CMDD, which is NP-hard. Then, we show that the objective function is monotone and submodular, which leads to a greedy approximation based on the theory given by Nemhauser et al. [11].

4.1 Objective Function of Influence Maximization Problem

The influence maximization problem is an optimal problem, in which given a graph G = (V, E), the number of the seed k, we want to find a seed set S of the size k such that the expected number of nodes is maximized. Now, we first consider the objective function of influence maximization problem.

At step t = 0, A 0(S) = S, the expected activated value of influence maximization under the CMDD is E t=0(S) = A 0(S). We can obtain the expected activated value at step t as follows:

$$ E_{t} (S) = \alpha \times E_{t - 1} (S) + \sum\limits_{{i \in V\backslash A_{t - 1} (S)}} {\frac{{\sum\limits_{{k \in A_{t - 1} (S) \cap N(i)}} {Inf_{k} } }}{{Inf_{i} + \sum\limits_{j \in N(i)} {inf_{j} } }}} $$
(2)

The overall expected activated values in t steps is equal to the sum of the expected activated value with t steps, that is,

$$ E(S) = \sum\limits_{t = 0}^{t} {E_{t} } (S) = \sum\limits_{t = 0}^{t} {(\alpha \times E_{t - 1} (S) + \sum\limits_{{i \in V\backslash A_{t - 1} (S)}} {\frac{{\sum\limits_{{k \in A_{t - 1} (S) \cap N(i)}} {Inf_{k} } }}{{Inf_{i} + \sum\limits_{j \in N(i)} {inf_{j} } }}} } ) $$
(3)

To select the optimal seed set to maximize the influence spread with the objective function and under the CMDD is NP-hard. We can prove the monotonicity and submodularity of the objective function.

Obviously, we have

$$ E_{t} (S \cup \left\{ u \right\}) \ge E_{t} \left( S \right) $$
(4)

Thus, the objective function E(S) is monotone.

We now prove the submodularity of objective function E(S).

Theorem 1.

The objective function is submodular, if for all subsets \( S_{1} \subseteq S_{2} \subseteq V \) and u ∈ V\S 2, we have E(S 1 ∪ {u}) − E(S 1) ≥ E(S 2 ∪ {u}) − E(S 2).

Proof.

We employ the Mathematical Induction to prove Theorem 1.

  • At step t = 1, the objective function E(S) is obviously submodular.

  • At step t − 1, if the objective function is submodular, then we have

$$ E_{t - 1} \left( {S_{1} \cup \left\{ u \right\}} \right) - E_{t - 1} \left( {S_{1} } \right) \ge E_{t - 1} \left( {S_{2} \cup \left\{ u \right\}} \right) - E_{t - 1} \left( {S_{2} } \right) $$
(5)

At step t, we have

$$ {\begin{aligned} & E_{t} (S_{1} \cup \{ u\} ) - E_{t} (S_{1} ) \\ & = \alpha (E_{t - 1} (S_{1} \cup \{ u\} ) - E_{t - 1} (S_{1} )) + \sum\limits_{{i \in V\backslash A_{t - 1} (u)}} {\frac{{\sum\limits_{{k \in A_{t - 1} (u) \cap N(i)}} {Inf_{k} } }}{{Inf_{i} + \sum\limits_{j \in N(i)} {inf_{j} } }}} - \left( {\sum\limits_{{i \in V\backslash A_{t - 1} (u \cap S_{1} )}} {\frac{{\sum\limits_{{k \in A_{t - 1} (u \cap S_{1} ) \cap N(i)}} {Inf_{k} } }}{{Inf_{i} + \sum\limits_{j \in N(i)} {inf_{j} } }}} } \right) \\ \end{aligned}} $$
(6)

We have the similar expression of \( E_{t} (S_{2} \cup \{ u\} ) - E_{t} (S_{2} ) \). We can see the activated process as flip the coin. Based on Equality (5) and (6), we have

$$ E_{t} (S_{ 1} \cup \left\{ u \right\}) - E_{t} \left( {S_{ 1} } \right) \, \ge E_{t} (S_{ 2} \cup \left\{ u \right\}) - E_{t} \left( {S_{ 2} } \right) $$
(7)

The linear combination of a submodular function is also submodular, so we have E(S 1∪{u}) − E(S 1) ≥ E(S 2∪{u}) − E(S 2).

4.2 Greedy Algorithm for Influence Maximization Problem

We have proven that the objective function of influence maximization problem under the CMDD is monotone and submodular. According to the result proposed in [11], the greedy algorithm given in Algorithm 1 can be used to approximate the optimal result with the relation of 1 − 1/e. The algorithm selects the node that provides the largest marginal gain to the seed set, and each time one node will be selected as a seed.

The running time of Algorithm 1 is determined by the greedy part at step 3. The time complexity of Algorithm 1 is O(knd 1 d 2), where k is the number of seeds, n is the number of nodes in network, d 1 is the average degree of nodes and d 2 is the max distance from other inactive nodes to node v, when we calculate the contribution of v.

5 Experimental Results

To test the feasibility and effectiveness of selecting seeds under the time cascade decay model, we implemented our method and made corresponding performance studies.

5.1 Experimental Setup

The ca-HepTh and ca-GrQc are HEP-TH (High Energy Physics-Theory) collaboration network extracted from the e-print (http://arXiv.org/). The former is extracted from the “High Energy Physics” and the latter is extracted from the “General Relativity”. The nodes in these two networks are authors and an edge between two nodes means the two coauthored at least one paper. The p2p-Gnutella08 record the Gnutella peer to peer network from August 8 2002, where nodes represent hosts in the Gnutella network topology and edges represent connections between the Gnutella hosts (Table 1).

Table 1. Statistics of the two real-world networks in resulting graph

5.2 Performance Studies

First, we tested the convergence rate of influence spread in ca-HepTh. In this experiment, we tested the influence spread with α = 0.4 and α = 0.8 under the CMDD respectively where spread time steps t = 5 and node userID = 1441 with high degree for obvious experiment result. Figure 2 shows that the convergence rate with α = 0.4 was faster that than α = 0.8, and the number of the convergence of influence spread with α = 0.4 and α = 0.8 were 1000 and 3000 respectively. This was because that the influence accumulation of node decreased slowly when the value of α is greater.

Fig. 2.
figure 2

The convergence rate of influence spread with different α value

Fig. 3.
figure 3

The expectation value of active nodes with different α values

Then, we tested the relationship of the expectation value of influence spread with different α in ca-HepTh. We compared the expectation value of influence spread with α = 0.2, α = 0.4, α = 0.6, α = 0.8 and α = 0, where we assigned the spread time step from 1 to 10 and node userID = 63113. The comparison is shown in Fig. 3. It can be seen that greater α, the greater expectation value under the CMDD, since the value of α is greater, the value of influence probabilities of nodes decreases slower.

It is known that the max-degree algorithm [25] is well regarded as the effective algorithm for the networks with power law distributions, and it sorts the nodes by the degree, and it selects k max degree nodes as seeds. Random algorithm selects k seeds randomly. Finally, we tested the effectiveness of Algorithm 1. In this experiment, we selected 20 seeds with Algorithm 1 with Depth = 1 and Depth = 2, the max-degree algorithm (denoted as Max-degree) and random algorithm (denoted as Random) to maximize the influence spread in ca-GrQc and p2p-Gnutella08 and set α = 0.4, t = 3. The depth in greedy algorithm means the max nodes distance we consider. If Depth = 1, we only consider the neighbors of active nodes. If Depth = 2, we consider not only the neighbors of active nodes but also the neighbors of their neighbors. Figure 4(a) shows that the greedy algorithm (denoted as Greedy) is better than Max-degree and outperforms Random. But in Fig. 4(b) and (c), the greedy algorithm is close to Max-degree, since the Inf v of node v is calculated by degree in our experiments, which verifies that our proposed CMDD model and the corresponding algorithm are feasible.

Fig. 4.
figure 4

Expectation value of active nodes with different algorithms inca-GrQc, p2p-Gnutella08 and ca-HepTh,

6 Conclusions and Future Works

In this paper, we redefined the node activate probability and proposed the CMDD, which is close to the real diffusion process. The CMDD reflects the change of probability with time step and new activated nodes, meanwhile it retains the cumulative effect and randomness. Then we proved the monotone and submodularity of this objective function and the greedy algorithm is used to approximate the optimal result.

However, our algorithm is not far superior to max-degree algorithm on some datasets. It is because the Inf v of node v is calculated by degree in our experiments. We will extend our experiments to some real networks in which the Inf v is determined by some actual factor. Furthermore, employing a constant to describe the diffusion decay parameter has its limitations. The decay factor function that can better describe the real spread process in a social network is still worth discussing. These are our next research directions.