Influence maximization algorithm based on reducing search space in the social networks

Aghaee, Zahra; Kianian, Sahar

doi:10.1007/s42452-020-03812-w

Influence maximization algorithm based on reducing search space in the social networks

Research Article
Published: 24 November 2020

Volume 2, article number 2067, (2020)
Cite this article

Download PDF

SN Applied Sciences Aims and scope Submit manuscript

Influence maximization algorithm based on reducing search space in the social networks

Download PDF

2493 Accesses
16 Citations
Explore all metrics

Abstract

The influence maximization problem in social networks is an optimization problem in viral marketing. This problem is concerned with identifying a certain number of people with the most influence in the social network level. Considering the NP-hard of this problem, finding an optimal solution with acceptable accuracy and the low running time is of the high importance. To this purpose, GIN (Group of Influential Nodes) algorithm is presented in this article which creates different groups of graph nodes with more connections than other groups. Then, it selects specific nodes from each group to reduce the search space to find the most influential nodes. Following the greedy method, it selects the seed nodes with the highest expected diffusion value. Experimental results show that the GIN algorithm has provided high influence spread along with low running time in comparison algorithms on all seven real-world datasets.

Identification of influential users in social media network using golden ratio optimization method

Article 28 September 2023

A quick GRASP-based method for influence maximization in social networks

Article Open access 30 September 2021

Layer-Prioritized Influence Maximization in Social Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The social networks like LinkedIn, Facebook, and Twitter provide the grounds for people to interact with no time and space limitation [1]. Consequently, information diffusion occurs across a wide range of social networks at high speed. That’s why business companies are using the speed of information diffusion on social networks to achieve widespread advertising and business optimization. Given the limited advertising resources, the main challenge in this problem is to select a specific set of influential persons that most effectively on other people in a short time. To solve this challenge, the influence maximization problem in social networks has been provided.

The influence maximization problem identifies the active nodes as seed nodes. Algorithms in this problem take the graph G and a number k as input and generate seed set, with the intention that the expected number of nodes influenced by the seed set by the stochastic process of the diffusion model, One major problem in the influence maximization problem to maximize the expected size of the final active set, given some constraints on the seed set. The diffusion process of the seed nodes is performed to maximize the influence spread. The influence of seed nodes is determined by the number of nodes activated. The influence maximization problem under both IC and LT models are NP-hard. In other words, the goal of the influence maximization problem is to find initial active people who have the most influence on other people in a short time under a diffusion model [2, 3].

Viral marketing [1, 4,5,6,7,8,9], terrorist attack prevention [4], and network control [10, 11] constitute the applications of the influence maximization problem that viral marketing has attracted researchers more than other applications. In viral marketing, unlike traditional marketing where information diffusion was through newspaper advertising and the mass media, the focus is on peoples effective through family members, relatives, acquaintances, and friends by word of mouth [12]. In this way, diffusion is information transmission from person to other person, such as virus transmission [1]. Domingos and Richardson were the first who addressed the influence maximization problem [6]. Kempe et al. formulated and proved that this problem is an NP-hard [2], who then proposed an approximation algorithm called General Greedy algorithm to solve the influence maximization problem [2]. This algorithm has very low speed and high running time [2]. Some researchers proposed some algorithms to overcome drawbacks in the General Greedy algorithm. The existing influence maximization algorithms have several major drawbacks:

1. They have no effective methods to reduce the search space for influence spread calculation and do not optimally select the candidate nodes.

2. They are not suitable for large-scale networks due to high computational time.

Handling the NP-Hardness and to optimize the seed selection by reducing search space while keeping the running time faster are important to influence maximization problem. So, to address these issues, we propose a new algorithm named GIN heuristic algorithm. In this algorithm, by the identified nodes with a more common number of connections and neighbors than other nodes of the graph, different groups of nodes are generated. By selecting certain nodes from each group, the search space where the final effective nodes are to be found is reduced. Its goal is to increase the speed of the algorithm. During this process, the nodes with the highest expected diffusion value are selected as the seed nodes.

The contributions of our work are as follows:

• Use a grouping method for graph nodes.

• We use candidate nodes optimally to reduce the search space for influence spread calculation.

• The experimental results on seven real-world networks show that the proposed algorithm GIN performs better than the other algorithms in terms of influence spread and running time.

The structure of this article is organized as follows. In Sect. 2, the related works for influence maximization problem is introduced. In Sect. 3, problem definition is expressed. Our proposed algorithm is presented in Sect. 4. Experiments results and evaluations are described in Sect. 5 and we conclude this article in Sect. 6.

2 Related works

Identifying seed nodes is one of the key issues for influence maximization problem in social networks. This is one of the NP-hard problems [2] that has many challenges like high influence spread and low running time, especially on large-scale graphs. Therefore, researchers have presented different algorithms to solve the influence maximization problem. For the first time, Kempe et al. presented an approximate algorithm called General Greedy algorithm with high influence spread that has an optimal approximation [2]. However, this algorithm has a high running time, especially on large-scale graphs [13]. So they proposed the High Degree algorithm by selecting nodes with the highest degree in graph [2]. The High Degree algorithm is very fast, even on large-scale graphs, but has a lower influence spread than the General Greedy algorithm.

Leskovec et al. proposed the CELF algorithm [14]. This algorithm uses lazy evaluation in functions that have the submodularity, significantly reduces influence spread evaluation. Although the CELF algorithm is faster than the General Greedy algorithm, it does not have acceptable speed. Then, Chen et al. proposed the NewGreedyIC algorithm to improve the CELF algorithm [15]. This algorithm computes the influence spread by creating sample graphs from the original graph. However, the NewGreedyIC algorithm improves the running time of the CELF algorithm, but this algorithm still needs to improve the running time, especially on large-scale graphs.

Chen et al. presented the DegreeDiscount algorithm to improve the running time of the CELF algorithm [15]. This algorithm uses a degree discount method for influence spread computations. The DegreeDiscount algorithm has a better running time than the CELF algorithm but does not guarantee optimal approximation. Also, Jiang et al. presented an evolutionary algorithm SA to improve the DegreeDiscount algorithm [16]. This algorithm has gained more influence spread than the DegreeDiscount algorithm, with a high number of iterations and increased running time. Then presented the SAEDV algorithm to reduce the running time of the SA algorithm [16]. In this algorithm, the seed nodes are selected with the maximum expected diffusion value based on the SA algorithm. The SAEDV algorithm has a lower running time than the SA algorithm.

Goyal et al. introduced the SIMPATH algorithm to improve the running time of the CELF algorithm [17]. This algorithm computes the influence spread without using Monte Carlo simulation and using simple paths count. The SIMPATH algorithm has a low memory overhead but does not guarantee optimal approximation. Another algorithm proposed to solve the influence maximization problem is the MIA algorithm [18]. This algorithm was proposed by Wang et al. to improve the computation of the influence spread of the DegreeDiscount algorithm. In the MIA algorithm, using the Dijkstra algorithm, the shortest maximum influence path for each node of the graph is calculated. Although the MIA algorithm has an acceptable running time by eliminating the number of Monte Carlo simulation, the disadvantage of this algorithm is the high memory overhead.

Jung et al. proposed the IRIE algorithm to improve the MIA algorithm [19]. In this algorithm, the influence spread is calculated recursively based on the influence spread of the neighbors of each node in the graph. In the IRIE algorithm, the memory overhead is very low, but the influence spread is low compared to the General Greedy algorithm. In the following, He et al. presented the CLDAG algorithm to improve the General Greedy algorithm [20]. The focus of this algorithm is on blocking the influence maximization problem. Thus, the most important advantage of this algorithm is not dependent on the number of seed nodes, and its main disadvantage is the high memory overhead.

Tang et al. proposed the TIM algorithm to improve the SIMPATH algorithm [21]. In this algorithm, the first set of RR-Set includes nodes that have paths to each other, and then the seed nodes are selected using graph nodes covering. Benefits of TIM algorithm are low running time and low memory overhead. Then, Lu et al. developed the RR-CIM algorithm to improve the High Degree algorithm [22]. This algorithm solves the influence maximization problem by using complementary products. The parameters of this algorithm close the influence maximization problem to the real-world, but the algorithm has a high running time.

In the following, He et al. presented a new algorithm based on community detection to improve the High Degree algorithm [23]. In this algorithm, each community is considered as one node. Then influential nodes are selected based on the red-black tree. This algorithm avoids the rich-club phenomenon but has a high running time. Also, Morone et al. Presented the CI algorithm based on localization of influence spread computation [24]. In this algorithm, the influence spread computation in a circle is limited to the radius L. The main advantage of the CI algorithm is to provide acceptable influence spread into high-density social networks, however, that the calculation of influence spread and running time of the CI algorithm depends on the parameter L and the number of seed nodes.

Then the topology-based algorithm LIR is presented by Liu et al. to improve DegreeDiscount and NewGreedyIC algorithms [25]. First, in this algorithm is calculated the LI of each node, which is indicated the local index of each node that avoids the Rich-club Effect but does not guarantee optimal approximation. The CoFIM algorithm is also based on community detection proposed by Shang et al. to improve High Degree algorithm [26]. In this algorithm, first, the influence spread is calculated based on diffusion between communities and then diffusion inside communities. The advantage of this algorithm is low running time, but the overlapping nodes are not considered. Then Wu et al. developed the LAIM algorithm to improve CoFIM algorithm [27]. By this algorithm, the influence is estimated locally. The main advantage of the LAIM algorithm is its very low running time, and its main disadvantage is that it does not guarantee optimal approximation.

Cui et al. presented an evolutionary algorithm called DDSE to improve the running time of the CELF algorithm [28]. In this algorithm, using the genetic algorithm and local and global search, the seed nodes are selected with the highest expected diffusion value. The DDSE algorithm has influence spread very close to the CELF algorithm on some small datasets. Xie et al. developed the IRR algorithm based on the MBIC model [29]. This algorithm uses the same idea as the DegreeDiscount algorithm. The main advantage of the IRR algorithm is the use of different states based on real-world criteria for nodes, but this algorithm is dependent on the number of nodes in the graph.

The C2IM algorithm is presented by Singh et al. to improve the time-efficiency of the TIM algorithm for the influence maximization problem [10]. This algorithm uses the user’s interests and community framework to find effective seed nodes. Also, Ahmadi Beni et al. presented an algorithm called TI-SC based on community detection to improve the influence spread of the DegreeDiscount algorithm [30]. This algorithm selects seed nodes by examining the relationships between the core nodes. The TI-SC algorithm is useful to reduce the overlap in selecting the seed nodes, but this algorithm has not an efficient performance in large-scale networks.

3 Problem definition

The social networks’ graph is mapped as G (V, E), where V and E are described as follows:

V = {v₁, v₂, …, v_n} is a set of graph nodes, where v_i is the individuals, and n is the number of nodes in the graph.
E = {e_{1, 2}, e_{1, 3}, …, e_{1, n}, e_{2, 1}, e_{2, 3}, …, e_{n, n − 1}} is a set of graph edges, where, e_{i, j} is the relationship between two individuals or an edge between each pair of nodes (i ≠ j).

Considering G (V, E) to find k seed nodes for influence maximization problem, the subset S select from V = {v₁, v₂, …, v_n} due to the diffusion model. The process of seed nodes selection is done until ∣ S ∣ = k so that the influence spread of S, σ(S) is maximized by Eq. (1) under the given diffusion model [31]:

$${S}^{\ast }={argmax}_{S\subset V, |S|=k}\sigma (S)$$

(1)

The influence spread is calculated due to the diffusion model for influence maximization problem. One of the most widely used information diffusion models is the independent cascade model proposed by Kempe et al. The independent cascade model has two important properties in the influence spread function which these properties are submodularity and monotonicity. A set function f : 2^V → ℝ is submodular if, for any S ⊆ T ⊆ V subsets and anyv ∈ V\T element , the marginal contribution of v element, when added to a T set can not exceed the marginal contribution of v element when added to a S ⊆ T subset due to Eq. (2) [31]:

$$f\left(S\cup \left\{v\right\}\right)-f(S)\ge f\left(T\cup \left\{v\right\}\right)-f(T)$$

(2)

The objective function f : 2^v → ℝ is monotone if we have f(S) ≤ f(T) for eachS ⊆ T ⊆ V subset [31].

4 Proposed algorithm

In GIN algorithm, a heuristic method is proposed to identify the influential nodes for the influence maximization problem. In this algorithm, different graph nodes are created to identify nodes with more communication and more common neighbors than other graph nodes. Then, certain nodes are selected from each group. This process is done to create new search space and reduce the number of influence spread calculations. At last, the final influential nodes are selected among the nodes in the new search space which maximize the expected diffusion value as the seed nodes. Figure 1 shows an overview of the general levels of GIN algorithm.

According to Fig. 1, the GIN algorithm consists of three levels: (1) grouping graph nodes, (2) creating new search space, and (3) selecting seed nodes.

4.1 Grouping graph nodes

The first level of algorithm GIN is due to AntCBO [32] algorithm by modification. This level of GIN algorithm includes the process of creating different groups of graph nodes with more communication and common neighbors than other nodes. This level consists of six basic steps:

Step 1 For every node of graph, an available variable is defined which determines the available or non- available of each node. The value of this variable is either True or False. At first, the value of this variable is considered True for all graph nodes. If the node has an available = True value, it will be available; so, it can be selected for calculations, otherwise non-available; thus, it cannot be selected for calculations. Figure 2 shows a graph consists of 16 nodes with available = True variable for all graph nodes.

Step 2 The graph nodes are sorted from highest to lowest degree in descending order in an undirected graph, or out-degree in a directed graph and placed in list L. Thus, in the first place of list, the node is with the highest degree, and in the last place, the node is with the lowest degree from graph, according to Fig. 3.

Step 3 The group g_i = {} is created (i = {1, 2, 3, …, l}). Then, the first node from list L which has an available = True value, is selected and added to g_i group (g_i ∪ {v_i}). Then, the availability value of this node changes into available = False. In the following, v_j node with the highest degree among v_i node ’s neighbors which has the available = True value, is selected and added to g_i group (g_i ∪ {v_j}). The availability value of v_j node is changed into available = False (Fig. 4).

Step 4 If v_i and v_j nodes do not have common neighbors, the computation continues from the previous step for the next node from L list with the value of avalable = True.

Step 5 If v_i and v_j nodes have common neighbors, ComNei set will be created including common neighbors of v_i and v_j nodes (ComNei = {v₁, v₂, …, v_m}) which {v₁, v₂, …, v_m} will be always belonging to ComNei for each v_i and v_j. The v_m node is selected from ComNei set with available = True value randomly and is added to g_i group (g_i ∪ {v_m}). Then, the availability value of this node is changed to available = False. If the neighbors of v_m node do not have any common with nodes within ComNei set , the availability of next node from ComNei set will be checked which is added to g_i group ; otherwise, Z set including the common nodes of v_m’s neighbors and ComNei set is created (Z = neighbor(v_m) ∩ ComNei). Then, ComNei set is replaced with Z set . This process is repeated until ComNei = {}, or the nodes in this set have available = false value (Fig. 5).

Step 6 The previous described steps are repeated until the created groups reach a specified number (Itrg_i = l). l is the maximum number of groups created; also, that is constant.

4.2 Creating new search space

In this level, the two nodes v_i and v_j which include the first and second added nodes are selected from each g_i group. The first node selected from each g_i group is the node v_i with the highest degree in that group, and the second node selected is the v_j node with the highest degree among the neighbors v_i. Each group g_i contains at least one v_i node. If there exists no node v_j in a g_i, that group has only one node, thus the only v_i node is selected. Also, if there are multiple nodes v_i that have the same maximum degree value in each g_i group, and similarly, if there are such multiple nodes v_j, these are selected randomly. In this way, by identifying and selecting these nodes from each group, the search space to find influential nodes is reduced. This level is repeated until the selected nodes reach a specified number (Ncandidate = m). m is the maximum number of candidate nodes selected.

4.3 Selection of seed nodes

In this level, the selection of seed nodes from a new search space is done by the greedy method. Thus, according to Eq. (3), the Expected Diffusion Value (EDV) of these nodes is calculated. This equation has also been used in other studies [16, 28]. The first time, Jiang et al. proposed the expected diffusion value to estimate the actual influence spread under the independent Cascade model and employed it in their algorithm to replace Monte Carlo simulation [16]. The EDV approximates the spread by computing how many direct neighbors of the nodes in the candidate node set S are expected to be activated. According to the Eq. (3), k is the number of seed node, p is the activation probability the non-seed nodes, ${N}_S^l\backslash S$ is the neighbors of the nodes S except the nodes of set S, and τ (i) is the number of neighbors of i node in set S.

$$EDV(S)=k+{\sum}_{i\epsilon {N}_S^l\backslash S}1-{\left(1-p\right)}^{\tau (i)}$$

(3)

Consequently, by applying the greedy selection method, a node from the new search space which causes maximization of the marginal gain function (EDV(S ∪ {w}) − EDV(S)) as the seed node added to the set S. This process is repeated until ∣ S ∣ = k. Then, the influence spread of seed nodes is calculated using the independent cascade model. Algorithm1 shows the pseudo-code of the GIN algorithm.

5 Experiments and evaluation of the proposed algorithm

The GIN algorithm has been evaluated on seven real datasets. Then the results of influence spread and running time of algorithms evaluated under the independent cascade model in a similar system. Influence spread means the number of final nodes activated by the seed nodes in the diffusion process. The running time means the time has passed to find the seed nodes. All the codes are written in C#, and all algorithms are performed on a Windows 10 operating system with 2.20 GHz CPU (Intel® Core (TN) i5-5200U) and 8 GB of RAM.

5.1 Description of the datasets

The GIN algorithm is implemented on seven datasets of different sizes. The characteristics of these datasets are shown in Table 1.

Table 1 Specifications of seven real-world social networks

Full size table

Email [33]: This dataset is a network of emails at URV University that includes professors, administrators, technicians, researchers, and graduates. In this dataset, each node represents the email address, and the link between the emails is an edge.

NetScience [34]: This dataset contains a collaboration network of scientists working on network theory. Newman compiles this dataset in the year 2006. The nodes in this dataset represent the authors of the articles, and the edges indicate the collaboration between the authors in a common article.

Power [35]: This dataset contains information about the power grid of the Western States of the United States of America that a node is either a generator, a transformator, or a substation. Also, an edge represents a power supply line.

NetHEPT [15]: This dataset contains arXiv articles in the “High Energy Physics Theory” section, published in the years 1991 to 2003. The nodes in this dataset represent the authors and, if they work together in an article, an edge is created between them.

NetPHY [15]: This dataset contains arXiv articles in the “Physics” section. In this network, each node is an author, and the edges between two nodes represent the collaboration of two authors in one article.

Epinions1 ^{Footnote 1}: This dataset is an online social network trust-based, collected from www.epinions.com. Each node in this dataset represents customers who are users of the site, and each edge represents a secure connection.

Slashdot0902 ^{Footnote 2}: This dataset is a version of the popular site www.slashdot.org about technology news where users can tag other users as friends or foes. The network was collected in February 2009.

5.2 The compared algorithms

The GIN algorithm is compared with various algorithms, including General Greedy, High Degree, SAEDV, LIR, and DDSE algorithms based on two criteria of influence spread and running time. The algorithm is implemented with parameter l = 150, Monte Carlo simulation R = 10000, and the activation probability p = 0.01.

General Greedy According to 2], this algorithm is evaluated by Monte Carlo simulation R = 10,000 and the activation probability p = 0.01 under independent cascade model.

High Degree According to [2], in this algorithm, the influence spread of k seed nodes with the highest degree of the graph is calculated under the independent cascade model.

SAEDV According to [16], in this algorithm, choose to initialize a node set randomly and optimize it with iterations. The parameters for SAEDV algorithm are set as follows: the initial temperature T₀ = 500000, the final temperature T_f = 100000, the number of inner loop q = 10 and the temperature drop after each inner loop ∆T = 2000.

LIR According to [25], in this algorithm, nodes with 0-LI are selected for the NetScience, NetHEPT, NetPHY, Epinions1, and Slashdot0902 datasets, while in the Email dataset, nodes with 0-LI are low, then nodes are selected with 1-LI. Influence spread calculations are performed by Monte Carlo simulation R = 10,000 and the activation probability p = 0.01 under independent cascade model.

DDSE According to [28], the results of this algorithm are obtained with the size of the population Npop = 10, mutation probability OP_Mutation = 0.1, Crossover probability OP_Crossover = 0.4, Number of generation the evolution lasts MaxIt = 200 and the diversity of the population Diversity = 0.6.

5.3 Diffusion model

Researchers have proposed various types of information diffusion models [2, 36,37,38,39,40]. One of the most widely used diffusion models is the independent cascade model, which has been used in most studies [2, 5, 37, 41, 42]. Our algorithm focuses on the influence maximization problem under the independent cascade model too. This model was introduced by Goldenberg et al. [5]. In this model, the nodes are either in an active or inactive state. Each edge between two nodes v and u has the activation probability P_vu that means inactive node u is active by node v. Each active node v at time t can only activate its neighbor’s inactive node once. And then, nodes that are activated at time t − 1 can activate their neighbor inactive nodes. Thus, if every active node v at time t with probability P_vu activates its neighbor inactive nodes, then at time t + 1 considered as an active node. The diffusion process will stop when no more new nodes are activated.

5.4 Experimental results

According to the description of the proposed algorithm, first, l groups are created. Here, the GIN algorithm is tested with different l to selecting the best value of parameter l with the highest influence spread on each seven real-world datasets. The influence spread results of this algorithm with different l are shown in Tables 2, 3, 4, 5, 6, 7, and 8 on Email, NetScience, Power, NetHEPT, NetPHY, Epinions1, and Slashdot0902 datasets respectively. According to these results, obviously, the influence spread of the different l is very close to each other algorithms, but the influence spread of the GIN with l = 150 is higher than other values of the parameter l in most of the k nodes. Thus, to compare the influence spread of the proposed algorithm with other algorithms is used parameter l=150 for the GIN algorithm on all of seven datasets.

Table 2 The influence spread of GIN algorithm with different numbers of l on Email dataset

Full size table

Table 3 The influence spread of GIN algorithm with different numbers of l on NetScience dataset

Full size table

Table 4 The influence spread of GIN algorithm with different numbers of l on Power dataset

Full size table

Table 5 The influence spread of GIN algorithm with different numbers of l on NetHEPT dataset

Full size table

Table 6 The influence spread of GIN algorithm with different numbers of l on NetPHY dataset

Full size table

Table 7 The influence spread of GIN algorithm with different numbers of l on Epinions1 dataset

Full size table

Table 8 The influence spread of GIN algorithm with different numbers of l on Slashdot0902 dataset

Full size table

The results of the evaluated algorithms in terms of influence spread and running time are shown in Fig. 6. These algorithms were implemented with fixed activation probability (p = 0.01) and varying sizes of the seed set with the range from k = 5 to k = 50 based on the independent cascade model. The x-axis indicates the seed nodes, and the y-axis indicates the influence spread.

The influence spread of the evaluated algorithms on the Email dataset is shown in Fig. 6a. In this figure, the results of the algorithms General Greedy and GIN are close to each other. The influence spread in the SAEDV algorithm is higher than the High Degree and DDSE algorithms. The influence spread in the DDSE algorithm from k = 5 to k = 20 is very close to that of the High Degree algorithm, and from k = 25 to k = 50, it follows a gradual reduction. The LIR algorithm has the lowest influence spread than the other algorithms in the Email dataset.

In Fig. 6b, the influence spread of the evaluated algorithms on the NetScience dataset is shown. In this figure, the influence spread of General Greedy, GIN, and DDSE algorithms are higher than other algorithms and are close to each other. The influence spread in LIR and SAEDV algorithms is less than the High Degree algorithm. Also, the influence spread of the comparable algorithms on the Power dataset is shown in Fig. 6c. According to Fig. 6c, General Greedy and GIN have higher influence spread than other algorithms. The highest influence spread belongs to DDSE after than the General Greedy and GIN algorithms. Then, the influence spread from high to low belongs to the High Degree, LIR, and SAEDV in most k nodes, respectively.

The influence spread of the evaluated algorithms on the NetHEPT dataset is shown in Fig. 6d, where the highest influence spread is for General Greedy and then GIN algorithm. The influence spread in the GIN algorithm, for k = 5 to k = 25, is very close to the General Greedy algorithm. The DDSE algorithm has a higher influence spread than the High Degree, LIR, and SAEDV algorithms. Of course, the results of the DDSE algorithm from k = 5 to k = 20 are close to the GIN algorithm. After the DDSE algorithm, the highest influence spread is for the High Degree algorithm and then the LIR algorithm. The influence spread of the SAEDV algorithm is very low on the NetHEPT dataset.

The influence spread of the evaluated algorithms on the NetPHY dataset is shown in Fig. 6e. In this figure, the influence spread of General Greedy and then GIN are higher than other algorithms. Also, the influence spread from most to less is for the DDSE algorithm and then the High Degree, LIR, and SAEDV algorithms. Also, the influence spread of the evaluated algorithms on the Epinions1 dataset is shown in Fig. 6f. Epinions1 is a large-scale dataset, and the General Greedy algorithm cannot run on it because this algorithm has a high running time. Therefore, the run of the General Greedy algorithm on the Epinions1 dataset is ignored. As observed in Fig. 6f, the GIN and High Degree algorithms have higher influence spread than other algorithms. However, GIN and High Degree algorithms have very close results in some k nodes, but the GIN algorithm in most k nodes has a higher influence spread than the High Degree. Then the influence spread of the DDSE algorithm is more than SAEDV and LIR algorithms.

In Fig. 6g, the influence spread of the algorithms on the Slashdot0902 dataset is shown. This dataset, like the Epinions1 dataset, has a large-scale, and the General Greedy algorithm cannot run on it because this algorithm has a high running time. Therefore, the run of the General Greedy algorithm on the Slashdot0902 dataset is ignored. According to Fig. 6g, the influence spread of the GIN algorithm is higher than other algorithms. Of course, the High Degree and DDSE algorithms in some k nodes have results close to the GIN algorithm. But in general, the GIN algorithm in the Slashdot0902 dataset has shown more influence spread than the compared algorithms. The SAEDV algorithm and then the LIR algorithm in this dataset has the lowest influence spread.

The speedup % (in terms of influence spread) of the GIN algorithm over other algorithms is shown in Table 9. In this table, the General Greedy algorithm called GG and the High Degree algorithm called HD. Due to the General Greedy lack of scalability, its results are not reported for Epinions1 and Slashdot0902 datasets. Table 9 shows the speedup % (in terms of influence spread) of our proposed algorithm GIN over other algorithms. The speedup is computed using Eq. (4). We can see that the GIN algorithm has a positive speedup over other algorithms in this table.

Table 9 Speedup % (in terms of influence spread) on different datasets

Full size table

$$Speedup=((\alpha -\beta)/\alpha )\times 100$$

(4)

According to the Eq. (4), α is the influence spread of the GIN algorithm, and β is the influence spread of the other algorithms for different k seed nodes.

The results of the running time algorithms are shown in Fig. 7 on the seven datasets Email, NetScience, Power, NetHEPT, NetPHY, Epinions1, and Slashdot0902. These results are for k = 50 and the number of the Monte Carlo simulation R = 10000 under the independent cascade model per second. According to Fig. 7, the High Degree and LIR algorithms have the lowest running time in all seven datasets. Of course, the High Degree and LIR algorithms have low the influence spread on more comparable datasets. Then, the running time from lowest to highest belongs to the GIN, DDSE, SAEDV algorithms, and then to the General Greedy algorithm.

Given the results of influence spread and running time of the compared algorithms in Figs. 6 and 7, it is quite obvious that the GIN algorithm has high influence spread along with low running time in all seven datasets.

6 Conclusion

Influence maximization is a problem that has been addressed in social networks analysis and viral marketing. The purpose of this problem is to find the k individual that, by activating them, the highest number of people in a social network can be activated at an acceptable time. For this purpose, the GIN algorithm is presented in this article. In this algorithm, the process of creating different groups, including graph nodes with more number of connections and more common neighbors than other groups, is performed. Hence, the search space to find the final influential nodes is reduced. This process reduces the number of computations and increases the speed of the proposed algorithm. Next, nodes from the new search space that maximize the marginal gain function of the expected diffusion value are selected as the seed node. Experimental results show that the GIN algorithm, in comparison algorithms on all seven real-world datasets, has provided high influence spread along with low running time.

Notes

References

Rui X et al (2019) A reversed node ranking approach for influence maximization in social networks. Appl Intell 49(7):2684–2698
Article Google Scholar
Kempe D, Kleinberg J, Tardos É (2003) Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Banerjee S, Jenamani M, Pratihar DK (2019) ComBIM: a community-based solution approach for the budgeted influence maximization problem. Expert Syst Appl 125:1–13
Article Google Scholar
Huang H, Shen H, Meng Z (2019) Community-based influence maximization in attributed networks. Appl Intell 1–11
Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark Lett 12(3):211–223
Article Google Scholar
Domingos P, Richardson M (2001) Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Richardson M, Domingos P (2002) Mining knowledge-sharing sites for viral marketing. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM
Pei S et al (2017) Efficient collective influence maximization in cascading processes with first-order transitions. Sci Rep 7:45240
Article Google Scholar
Chen Y et al (2020) Semantics-aware influence maximization in social networks. Inf Sci 513:442–464
Article MathSciNet Google Scholar
Singh SS et al (2019) C2IM: community based context-aware influence maximization in social networks. Phys A Stat Mechan Appl 514:796–818
Article MathSciNet Google Scholar
Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web (TWEB) 1(1):5
Article Google Scholar
Liqing Q et al (2020) TSIM: a two-stage selection algorithm for influence maximization in social networks. IEEE Access 8:12084–12095
Article Google Scholar
Srinivasan S, Babu LD (2019) Interest aware influential information disseminators in social networks. SN Appl Sci 1(11):1456
Article Google Scholar
Leskovec J, et al (2007) Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM
Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Jiang Q, et al (2011) Simulated annealing based influence maximization in social networks. In Twenty-fifth AAAI conference on artificial intelligence
Goyal A, Lu W, Lakshmanan LV (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In 2011 IEEE 11th international conference on data mining. IEEE
Wang C, Chen W, Wang Y (2012) Scalable influence maximization for independent cascade model in large-scale social networks. Data Min Knowl Disc 25(3):545–576
Article MathSciNet MATH Google Scholar
Jung K, Heo W, Chen W (2012) Irie: scalable and robust influence maximization in social networks. In 2012 IEEE 12th international conference on data mining. IEEE
He X, et al (2012) Influence blocking maximization in social networks under the competitive linear threshold model. In Proceedings of the 2012 siam international conference on data mining. SIAM
Tang Y, Xiao X, Shi Y (2014) Influence maximization: Near-optimal time complexity meets practical efficiency. In Proceedings of the 2014 ACM SIGMOD international conference on management of data
Lu W, Chen W, Lakshmanan LV (2015) From competition to complementarity: comparative influence diffusion and maximization. Proc VLDB Endow 9(2):60–71
Article Google Scholar
He J-L, Fu Y, Chen D-B (2015) A novel top-k strategy for influence maximization in complex networks with community structure. PLoS One 10(12):e0145283
Article Google Scholar
Morone F et al (2016) Collective influence algorithm to find influencers via optimal percolation in massively large social media. Sci Rep 6:30062
Article Google Scholar
Liu D et al (2017) A fast and efficient algorithm for mining top-k nodes in complex networks. Sci Rep 7:43330
Article Google Scholar
Shang J et al (2017) CoFIM: a community-based framework for influence maximization on large-scale networks. Knowl-Based Syst 117:88–100
Article Google Scholar
Wu H et al (2018) LAIM: a linear time iterative approach for efficient influence maximization in large-scale networks. IEEE Access 6:44221–44234
Article Google Scholar
Cui L et al (2018) DDSE: a novel evolutionary algorithm based on degree-descending search strategy for influence maximization in social networks. J Netw Comput Appl 103:119–130
Article Google Scholar
Xie G et al (2019) MBIC: a novel influence propagation model for membership-based influence maximization in social networks. IEEE Access 7:75696–75707
Article Google Scholar
Beni HA, Bouyer A (2020) TI-SC: top-k influential nodes selection based on community detection and scoring criteria in social networks. J Ambient Intell Humaniz Comput 1–20
Chen W, Lakshmanan LV, Castillo C (2013) Information and influence propagation in social networks. Synth Lect Data Manag 5(4):1–177
Article Google Scholar
Zhou X et al (2015) An ant colony based algorithm for overlapping community detection in complex networks. Phys A Stat Mech Appl 427:289–301
Article Google Scholar
Guimera R et al (2003) Self-similar community structure in a network of human interactions. Phys Rev E 68(6):065103
Article Google Scholar
Newman ME (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E 74(3):036104
Article MathSciNet Google Scholar
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’networks. Nature 393(6684):440–442
Article MATH Google Scholar
Peng S et al (2018) Influence analysis in social networks: A survey. J Netw Comput Appl 106:17–32
Article Google Scholar
Wang W, Street WN (2018) Modeling and maximizing influence diffusion in social networks for viral marketing. Appl Netw Sci 3(1):6
Article Google Scholar
Singh SS et al (2019) MIM2: multiple influence maximization across multiple social networks. Phys A Stat Mech Appl 526:120902
Article Google Scholar
Tejaswi V, Bindu P, Thilagam PS (2016) Diffusion models and approaches for influence maximization in social networks. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE
Banerjee S, Jenamani M, Pratihar DK (2018) A survey on influence maximization in a social network. arXiv preprint arXiv:1808.05502
Zhong Z, Yang H (2019) Community Structure-based re-ranking information influence maximization algorithm for typhoon disasters. Ekoloji Dergisi 2019(107)
Zhu R, et al (2018) QuickIM: efficient, accurate and robust influence maximization algorithm on billion-scale networks. arXiv preprint arXiv:1805.12320

Download references

Author information

Authors and Affiliations

Faculty of Computer Engineering, Shahid Rajaee Teacher Training University (SRTTU), Tehran, Iran
Zahra Aghaee & Sahar Kianian

Authors

Zahra Aghaee
View author publications
You can also search for this author in PubMed Google Scholar
Sahar Kianian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zahra Aghaee.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest.

Additional information

Publisher's Note

pringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aghaee, Z., Kianian, S. Influence maximization algorithm based on reducing search space in the social networks. SN Appl. Sci. 2, 2067 (2020). https://doi.org/10.1007/s42452-020-03812-w

Download citation

Received: 24 April 2020
Accepted: 31 October 2020
Published: 24 November 2020
DOI: https://doi.org/10.1007/s42452-020-03812-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Influence maximization algorithm based on reducing search space in the social networks

Abstract

Similar content being viewed by others

Identification of influential users in social media network using golden ratio optimization method

A quick GRASP-based method for influence maximization in social networks

Layer-Prioritized Influence Maximization in Social Networks

1 Introduction

2 Related works

3 Problem definition