Keywords

1 Introduction

Social networks play an important role as a media for the spread of various information. For example, the diffusion of disease [1], viruses and even malicious rumors propagation [24], the information of product diffusion through the viral marketing [5, 6], etc. Understanding the dynamics of these networks may help us to control the disease (or computer viruses), minimize the spread of rumors and promote products. In this paper, we take the product promotion of social networks (i.e. viral marketing) as the background. Viral marketing takes the advantage of “word-of-mouth” among the relationships of individuals to spread the influence of products. The spread of viral marketing in a social network can be described as follows. First, we select some initial nodes (i.e., seeds) with free samples or provide the information of products. Then these initial nodes will tell the information of products to their friends, who then tell it to their friends and so on, which is called as cascade spread. Finally, a large portion of nodes will be influenced by these seeds [5].

It is known that how to select these seeds to maximize the influence spread is the problem of influence maximization. Domingos and Richardson [7] considered the influence maximization as an algorithmic problem, where the customer network was modeled as a graph and a Markov random field was used to calculate influence propagation among them. Kempe et al. [8] formulated the influence maximization as a discrete optimization problem and proposed two diffusion models based on the early work [912]: Independent cascade model (ICM) and linear threshold model (LTM), under which the focus is to select k seeds to maximize the influence spread. The problem is clearly NP hard, but the greedy algorithm can be used to approximate the optimal result based on the submodularity.

Focusing on how to design a new heuristic algorithm that is easily scalable to large-scale social networks, some researchers have improved the scalability of the Kempe et al.’s greedy algorithm for influence maximization [1419]. From the perspective of challenges in the studies of influence maximization, there frequently exists competition among the influences of two or more ideas or product information in a social network [2023], such as the same product of competing companies Apple and Samsung or two political candidates of the opposing parties Bush and Hillary and so on. They all want to attract people’s attention and spread their influence as much as possible in a social network. Thus, it is necessary to select the initial nodes to spread the influence via the relationships among individuals, exactly the problem that we will solve in this paper.

In our study, we focus on the problem of selecting the seeds for maximizing the competitive influence spread in a social network, that is, how to select k seeds to maximize the competitive influence spread under certain diffusion model given the seed set of competing product I B and the budget k of one product? For this purpose, we consider the following problems:

  1. (1)

    How to construct the spread model of competitive influence spread?

  2. (2)

    How to select the k seeds?

We first denote the social network as a directed graph G = (V, E), where V is the set of nodes representing individuals and E is the set of directed edges representing relationships among the individuals. Each edge e(u, v) in G is associated with a propagation probability p(u, v), where 0 < p(u, v) ≤ 1.

For the problem (1), it is natural to consider the classical IC model, a popular influence diffusion model that describes how influence is propagated throughout the network starting from the initial seed nodes. Chen et al. [25] has proved that computing the influence spread given a seed set under the IC model is #P-hard, where the hardness of calculating the influence is due to the probability P(u, v) of edge e(u, v). If the probability of each edge is deterministic (i.e., the probability of each edge is exactly 1), then the breadth-first-search (BFS) can be used to obtain the influenced nodes incurred by a seed set. Therefore, the linear-time algorithm for computing the influence spread can be obtained in a deterministic graph [26]. In this paper, we can take advantage of possible graphs to effectively obtain the active nodes of the competitive influence spread. We first select top possible graphs from all possible graphs to effectively approximate the optimal result. We further give the competitive influence spread model (CISM) to describe the competitive diffusion process in a possible graph, where the competitive information diffusion process can be well reflected. The construction of CISM can be described as follows. Initially, two sets of nodes in the social network are selected as the seeds of A and B respectively, which are then activated, denoted as A-activated and B-activated respectively. At each step, the nodes of A-activated and B-activated try to activate their out-neighbors with probability 1 by the “live-edge” in possible graph, and the influence that A dominates.

For the problem (2), the optimization problem for selecting the most effective k seeds given the seed set of B is NP hard under the CISM. This objective function is monotone and submodular, and propose the CELF algorithm to approximately solve the problem of maximization competitive influence with 1 − 1/e. The CELF algorithm is an accelerated algorithm, which can avoid evaluations when they are not necessary. The CISM with the CELF algorithm selects a currently best seed iteratively from V − I B starting from an empty set, which can maximize the competitive influence spread until k seeds are selected.

To test the efficiency and effectiveness of the proposed CELF algorithm for the CISM under the possible graphs, we implement our algorithms and make corresponding experiments to show the feasibility.

The remainder of this paper is organized as follows. In Sect. 2, we introduce the idea to obtain the possible graphs. In Sect. 3, we give the competitive influence spread model of the possible graph. In Sect. 4, we exploit the approximate algorithm to maximize the competitive influence spread. In Sect. 5, we show experimental results and performance studies. Finally in Sect. 6, we conclude and discuss further work.

2 Generating Possible Graphs

In a social network, the process of calculating the influence spread under the IC model and LT model is #P-hard when the seed set has been given. Similarly, in this paper, the process of calculating the influence of A under the IC model and LT model is also #P-hard given the seed set I B and I A . The hardness derives from the calculation of \( P_{uv}^{A} \) and \( P_{uv}^{B} \) of edge e(u, v). Therefore, we exploit the approximate algorithm to calculate the influence of A and that of B simultaneously.

Hu et al. [17] proposed possible graphs, similar to the subgraphs proposed by Chen et al. [24], where the possible graphs are generated by the following idea.

For a given directed graph G = (V, E, P), the number of nodes (or resp. edges) is n (or resp. m), and there are 2m “live-edge” and “block-edge” possible graphs in G. Let \( G^{\prime} = (V^{\prime},E^{\prime}) \) denote a possible graph of G, where \( V^{\prime} = V \), \( E^{\prime} \subseteq E \), \( P^{\prime}(e) = 1 \) for all \( e \in E^{\prime} \). The existence probability of possible graph \( G^{\prime} \) is as follows [24].

$$ P(G^{\prime}) = \prod\limits_{{e \in G^{\prime}}} {P(e)} \times \prod\limits_{{e^{\prime} \in G\backslash G^{\prime}}} {(1 - P(e^{\prime})} ) $$
(1)

Based on the work in [17, 24], we propose the method for calculating the probability of each possible graph of competitive influence spread, in which each edge e(u, v) has two diffusion probabilities, \( P_{uv}^{A} \) and \( P_{uv}^{B} \). In order to generate the possible graphs of competitive influence spread, we consider the following two situations: \( P_{uv}^{A} \ne P_{uv}^{B} \) and \( P_{uv}^{A} = P_{uv}^{B} \).

If \( P_{uv}^{A} \ne P_{uv}^{B} \), then the existence probability of possible graph \( G^{\prime}_{A} \) (or resp. \( G^{\prime}_{B} \)) equals to the product of probabilities of all the edges \( G^{\prime}_{A} \) (or resp. \( G^{\prime}_{B} \)), formally described as

$$ P(G^{\prime}_{A} ) = \prod\limits_{{e \in G^{\prime}_{A} }} {P(e)} \times \prod\limits_{{e^{\prime} \in G\backslash G^{\prime}_{A} }} {(1 - P(e^{\prime})} ) $$
(2)
$$ P(G^{\prime}_{B} ) = \prod\limits_{{e \in G^{\prime}_{B} }} {P(e)} \times \prod\limits_{{e^{\prime} \in G\backslash G^{\prime}_{B} }} {(1 - P(e^{\prime})} ) $$
(3)

If \( P_{uv}^{A} = P_{uv}^{B} \), then the probability of each possible graph can be described as

$$ P(G^{\prime}_{A} ) = Pr(G^{\prime}_{B} )\; = \prod\limits_{{e \in G^{\prime}_{A} }} {P(e)} \times \prod\limits_{{e^{\prime} \in G\backslash G^{\prime}_{A} }} {(1 - P(e^{\prime})} ) = \prod\limits_{{e \in G^{\prime}_{B} }} {P(e)} \times \prod\limits_{{e^{\prime} \in G\backslash G^{\prime}_{B} }} {(1 - P(e^{\prime})} ) $$
(4)

In this paper, we assume \( P_{uv}^{A} = P_{uv}^{B} \), which can be easily extended to \( P_{uv}^{A} \ne P_{uv}^{B} \).

Example 1.

In Fig. 1(a), v 1 is selected as the seed at step t = 0. At step t = 1, v 1 tries to active v 2 and v 3, and v 1 successfully active v 3, but v 1 fails to active v 2. In Fig. 1(b), the edge e(v 1, v 3) is called as “live-edge”, and the edge e(v 1, v 2) is called as “block-edge”.

Fig. 1.
figure 1

Example of “live-edge” or “block-edge” graph and possible graph. In (a) and (b), green nodes denote the active nodes, and white nodes denote inactive nodes. A solid green arc from node v 1 to v 3 means that v 1 successfully activates v 3 through this arc. A dotted green arc from node v 1 to v 2 means that v 1 fails activates v 2. (c) is the possible graph of G obtained by (a), where \( V^{\prime} = V \), and \( E^{\prime} = \{ e(v_{ 1} ,v_{ 3} ),e(v_{ 2} ,v_{ 3} )\} \subseteq E \). (Color figure online)

Based on Eq. (1), we can obtain the probability of possible graph \( G^{\prime} \) as follows:

$$ \begin{aligned} P(G^{\prime}) & = P(v_{ 1} ,v_{ 3} ) \times P(v_{ 3} ,v_{ 2} ) \times ( 1- P(v_{ 1} ,v_{ 2} )) \times ( 1- P(v_{ 2} ,v_{ 3} )) \times ( 1- P(v_{ 2} ,v_{ 4} )) \times ( 1- P(v_{ 3} ,v_{ 4} )) \\ & = 0.0 60 4 8\\ \end{aligned} $$

3 The Competitive Influence Spread Model

We now give the diffusion model of the possible graph for competitive influence spread.

In a possible graph \( G^{\prime} = (V^{\prime},E^{\prime}) \), the diffusion probability of each product by “live-edge” is \( P_{{(u^{\prime},v^{\prime})}}^{A} \) and \( P_{{(u^{\prime},v^{\prime})}}^{B} \), where \( P_{{(u^{\prime},v^{\prime})}}^{A} = P_{{(u^{\prime},v^{\prime})}}^{A} = 1 \). We can describe the competitive influence spread model (CISM) of possible graph as follows.

In the CISM, each node has three states, A-activated (i.e., individual buys product A), B-activated (i.e., individual buys product B), and inactive. In every step, each activated node tries to active its out-neighbors by the live-edge of A and B based on the following rules. The discrete time step t = 0, 1, 2, …, n is used to describe the diffusion process.

At step t = 0, the seed sets I A , I B V are activated and I A  ∩ I B  = ϕ.

Let \( I_{A}^{t} \subseteq V \) and \( I_{B}^{t} \subseteq V \) be the sets of nodes activated by I A and I B respectively at step t.

At step t + 1, for any node \( v \in N^{out} (I_{A}^{t} \cup I_{B}^{t} ) \), where \( v \in N^{out} (I_{A}^{t} \cup I_{B}^{t} ) \) denotes out-neighbors of \( I_{A}^{t} \cup I_{B}^{t} \). We consider the following four situations:

  1. (a)

    If the node v is only be reached by the “live-edges” from \( I_{A}^{t} \), then v is added into \( I_{A}^{t + 1} \).

  2. (b)

    If the node v is only be reached \( I_{B}^{t} \) by the “live-edges” from \( I_{B}^{t} \), then v is added into \( I_{B}^{t + 1} \).

  3. (c)

    If the node v can be reached by “live-edges” from \( I_{A}^{t} \) and \( I_{B}^{t} \), then the influence of A dominates and thus node v is added into \( I_{A}^{t + 1} \).

  4. (d)

    If the node v cannot be reached by “live-edges” from \( I_{A}^{t} \) and \( I_{B}^{t} \), then the influence of A and B to node v is “block”, and thus v is inactive.

The activation process stops when there are no new active nodes in a time step.

Example 2.

Now we give an example of CISM in Fig. 2. Figure 2(a) is the possible graph \( G^{\prime} \), where e(v 1, v 3), e(v 2, v 3), e(v 2, v 4), and e(v 3, v 4) are the “live-edges”, and e(v 1, v 2) and e(v 3, v 2) are the “block-edges”. In Fig. 2(b), v 2, as the seed of B, can reach nodes v 3 and v 4 by the “live-edge” following the CISM. In Fig. 2(c), v 2, as the seed of B, and v 3, as the seed of A, reach v 4 by the “live-edge”, and v 4 is activated by v 3 following the CISM.

Fig. 2.
figure 2

Example of CISM

4 Maximizing the Competitive Influence Spread

In this Section, we discuss the objective function that selects k seeds of A under the CISM, when the seeds of B have been known.

4.1 Objective Function for Competitive Influence Spread

We use Formula (4) to compute the probability P(G i ) of each G i obtained in Sect. 2. In a possible graph G i , we compute the expectation value \( \sigma_{{G_{i} }} (I_{A} ,I_{B} ) \) of each G i , that is, we compute the number of nodes activated by I A under the CISM when the seed set I A and I B are spread simultaneously. The objective function of graph G can be formally described as follows.

$$ \sigma_{G} (I_{A} ,I_{B} ) = \sum\limits_{i = 1}^{m} {Pr(G_{i} )} \times \sigma_{{G_{i} }} (I_{A} ,I_{B} ) $$
(5)

Selecting k optimal nodes to maximize their influence when the initial nodes I B have been known under the CISM is NP hard. The objective function σ G (S A , I B ) is monotone and submodular under the CISM and σ G (ϕ, I B ) = 0. Based on the theorem given by Nemhauser et al. [13], we can use a greedy algorithm to approximate the optimal result with 1 − 1/e (where e is the base of natural logarithm).

4.2 Approximate Algorithm for Maximizing the Competitive Influence Spread

According to the conclusion in Sect. 4.1, we adopt the CELF algorithm proposed by Leskovec et al. [14] to select the seeds of A. In Algorithm 1, we select the most effective seed of A from V − I B in each iteration, until k seeds are selected.

First, we describe the basic idea of the algorithm as follows.

Let \( \sigma_{G} (u|I_{A} ,I_{B} ) \) denote the marginal gain of node u added into the seed set I A when the seed set I B spreads simultaneously, that is, \( \sigma_{G} (u|I_{A} ,I_{B} ) = \sigma_{G} (I_{A} \cup \{ u\} ,I_{B} ) - \sigma_{G} (I_{A} ,I_{B} ) \).

In the first iteration, we select the first element of Q into the seed set I A .

In the i-th iteration (1 < i ≤ k), if \( \sigma_{G} (v_{j}^{i} |I_{A} ,I_{B} ) \) is not smaller than the margin gain of all the other nodes v l (v l  ∈ V\v j  ∪ I A′  ∪ I B ) added into the set \( I_{{A^{\prime}}} \) in the earlier iteration, i.e., \( \sigma_{G} (v_{j}^{i} |I_{A} ,I_{B} ) \ge \sigma_{G} (v_{l}^{i = 1:i - 1} |I_{{A^{\prime}}} ,I_{B} ),I_{{A^{\prime}}} \subset I_{A} \), which means that v l does not need to be computed in the i-th iteration, and thus v j is added into I A , where \( \sigma_{G} (v_{l}^{i = 1:i - 1} |I_{{A^{\prime}}} ,I_{B} ) \) represents the margin gain of v l added into the subset of \( I_{{A^{\prime}}} \) in the earlier iteration.

In the i-th iteration (1 < i ≤ k), if \( \sigma_{G} (v_{j}^{i} |I_{A} ,I_{B} ) \) is not smaller than the margin gain of some nodes v l (v l  ∈ V\v j  ∪ I A  ∪ I B ) added into \( I_{{A^{\prime}}} \) in the earlier iteration, but it is not larger than the margin gain of other nodes v m (v m  ∈ V\v j  ∪ I A  ∪ I B ) added into the set \( I_{{A^{\prime}}} \) in the earlier iteration, i.e., \( \sigma_{G} (v_{l}^{i = 1:i - 1} |I_{{A^{\prime}}} ,I_{B} ) < \sigma_{G} (v_{j}^{i} |I_{A} ,I_{B} ) < \sigma_{G} (v_{m}^{i = 1:i - 1} |I_{{A^{\prime}}} ,I_{B} ) \). Then we need compute the margin gain of these nodes v m added into I A in the i-th iteration, and select the node with the maximal value of \( \sigma_{G} (v_{l}^{i} |I_{A} ,I_{B} ) \) as the seed of A.

Leskovec et al. [14] empirically showed that the CELF algorithm can provide 700 times of speed-up for greedy algorithm. Therefore, Algorithm 1 is 700 times faster when compared to the greedy algorithm. The time complexity of greedy algorithm is O(RKnm), where R, k, n and m is the number of possible graphs, seeds, nodes and edges respectively.

5 Experimental Results

To test the feasibility and effectiveness of maximizing the competitive influence spread under the CISM in the possible graphs, we conduct experiments on four real-world datasets.

5.1 Experiment Setup

The NetHEPT and Ca-GrQc are Collaboration networks extracted from the ePrint arXiv (http://www.arXiv.org), which is the same source used in the experimental study in [8]. The former is extracted from the “High Energy Physics-Theory” and the latter is extracted from the General Relativity. The nodes in these two networks are authors and an edge between two nodes means the two coauthored at least one paper. The p2p-Gnutella08 record the Gnutella peer to peer network from August 8 2002 where nodes represent hosts in the Gnutella network topology and edges represent connections between the Gnutella hosts. The Wiki-Vote is directed graph that Wikipedia users vote the administrators, where the nodes represent Wikipedia users and a directed edge from node u to node v represents that user u voted on user v.

We use the trivalency cascade model [16] to generate the influence weight of edges. On each edge, we uniformly at random select a probability from {0.33, 0.66, 0.99}, corresponding to high, medium, low influences.

In order to measure the spread effectiveness of influence for different target sets, we compared the CELF algorithm for competitive influence spread with the max-degree heuristic and random heuristic on the above four datasets. The CELF algorithm for competitive influence spread chooses the seeds by Algorithm 1. The max-degree heuristic chooses nodes with the largest degree as the product seeds of A. The random heuristic randomly chooses nodes as the product seeds of A.

In order to exploit the relationship of ration of the number of B (i.e., product B) seeds to the number of A (i.e., product A) seeds (i.e., |I B |/|I A |) with the A-activated nodes, we considered the A-activated nodes under the value of |I B |/|I A | from 0.1 to 1 when the seed set I B is fixed.

First, we tested the effectiveness of Algorithm 1. In this experiment, we select 10 seeds with random algorithm as the initial seeds of B and select 30 seeds of A with the Algorithm 1, max-degree and random algorithm to maximize the spread of A in NetHEPT, ca-GrQc, p2p-Gnutella08 and Wiki-Vote networks. Figure 3 shows that the CELF algorithm outperforms the max-degree algorithm and the random algorithm. This is because some of max-degree seed nodes may be clustered, and selecting all of them as the seeds of A cannot effectively spread the influence of A. By the random heuristic, as a baseline heuristic algorithm, some of selected seeds cannot spread the influence effectively.

Fig. 3.
figure 3

A-activated nodes of different algorithms

Then, we tested the number of I B to the number of I A (i.e., |I B |/|I A |) with the A-activated nodes. In this experiment, we chose 3000, 2598, 3000, 1369 nodes and 7494, 9958, 9014, 16373 edges from NetHEPT, ca-GrQc, p2p-Gnutella08 and wiki-Vote networks respectively and generate four synthetic networks, which are called as NetHEPT-new, ca-GrQc-new, p2p-Gnutella08-new, and Wiki-Vote-new respectively. We select 5, 10 and 15 seeds by the random heuristic as the initial seeds of B respectively and set the value of |I B |/|I A | from 0.1 to 1 to maximize the influence spread of A with the Algorithm 1. Figure 4 shows that the number of A-activated nodes is decreased with the increase of the value of |I B |/|I A | when the seeds of B are fixed. This is because the value of |I B |/|I A | is decreased when the seed set of I A is increased, and thus the number of A-activated nodes is increase.

Fig. 4.
figure 4

The A-activated nodes with the value of |I B |/|I A | from 0.1 to 1

6 Conclusions and Future Work

Aiming at the effective of selecting seeds for the competitive influence maximization, we proposed the CISM under the possible graph, under which we can obtain the active nodes by the BFS. The possible graph can overcome the hardness of calculating the influence probability of a social network, and the CISM can well reflect the process of competitive influence. Further, we gave the submodular function and CELF algorithm for solving the problem of competitive influence maximization, which exploit the submodularity to accelerate the Greedy algorithm.

The CELF algorithm for a possible graph proposed in this paper can select seeds for competitive influence maximization. However, the CELF algorithm is not effective for large scale social networks. For our future work, we plan to explore more effective algorithms for the competitive influence maximization under the possible graphs. Other than the effectiveness, one interesting direction is to consider the influence quality of competitive products, and another interesting direction is to consider asynchronous product spread in a social network.