Keywords

1 Introduction

With the development of the Internet and the continuous improvement of information technology, people are more and more inclined to conduct social activities on the Internet, and express their views or share their daily life on social software, thus giving rise to social networks mediated by social software. As a platform, social networks play an important role in the interaction between individuals and the dissemination of information and ideas. Among the commonly used social software, Facebook has 2.2 billion users, WeChat has 1 billion users, and Twitter has 340 million users [1]. People are not lonely in the real life as well as in social networks. They will form groups based on their common interests and hobbies or some kind of relationship. The group can be large or small: as small as two or three people, such as a family; or as large as a state or even a country. Because social networks have such important influence, they have great applications in information dissemination, advertising marketing, public opinion control and other aspects. We take viral marketing as an example, when advertising companies want to achieve a good marketing result with limited cost, they will use the “word-of-mouth” effect to select k users to maximize advertising audiences through their “mouths”. In addition, individual decisions depend on group decisions. For example, a company needs to buy pens of a certain brand for all employees. When a majority of employees are promoted by a certain brand and decide to buy pens of that brand, the company will buy all pens of that brand. Another example is the US presidential election. If a presidential candidate wins a majority in a state election, he or she will have all the electoral votes of the state. This is also a practical example of maximizing group influence.

1.1 Related Work

To the best of our knowledge, Domingos and Richardson et al. [2] were the first to address the issue of Influence Maximization (IM). Kempe et al. [3] were the first to formulate IM as a discrete optimization problem. They showed the IM problem is NP-hard under either the Independent Cascade (IC) model or Linear Threshold (LT) model, while the objective function is submodular. They proposed a greedy algorithm using Monte Carlo method to simulate the process of influence propagation, which achieved \(\left( 1-\frac{1}{e}-\varepsilon \right) \) approximation solution for any \(\varepsilon >0\). However, the computation time of the greedy algorithm is expensive. Subsequently, more and more scholars have dedicated themselves to the study of IM problem and proposing approximation algorithms based on improved greedy approach and heuristic algorithms [4,5,6,7,8]. Heuristic algorithms are favored by scholars because of its fast computation speed, but its approximate quality is not as good as the approximation algorithms. In particular, the improved greedy algorithms are faster than the traditional greedy algorithms via making use of the submodularity of the objective function.

Due to the further study of many scholars, the algorithms for solving IM problem have also developed rapidly. Among them, Borgs et al. [9] proposed RIS algorithm, which greatly reduced the computational time of the simulation propagation process. Based on RIS algorithm, Tang et al. [10, 11] proposed TIM, TIM\(^{+}\) and IMM algorithms, which guaranteed a \(\left( 1-\frac{1}{e}-\varepsilon \right) \) approximation ratio under IC model. Recently, Nguyen et al. [12] proposed the SSA and D-SSA algorithms, which were the first approximation algorithm which satisfies the strict theoretical threshold of IM with the minimum sample set.

A social network is divided into multiple communities by community discovery algorithms. The nodes within communities are closely connected while the nodes among communities are sparsely connected, so the influence within communities spreads quickly in a wide range. The two most commonly used community discovery algorithms are OASNET (Optimal Allocation in a Social NETwork) algorithm and CGA (Community-based Greedy Algorithm) algorithm. The OASNET algorithm was proposed by Cao et al. [13] to solve the IM problem by using the optimal dynamic allocation of resources. CGA algorithm [14] combined the dynamic programming method and the greedy algorithm to allocate the optimal number of seed nodes for each community so as to maximize the influence. Ji et al. [15] proposed a new algorithm, which found out the hidden community structure in the network and then selected k nodes with the largest number of community coverage as seed nodes. Moreover, many researchers are committed to studying the property of community to solve IM aiming to maximize the number of eventually activated nodes, while the task of GIM is to activate maximum groups rather than individuals [16,17,18,19,20,21,22].

In terms of research on GIM, Zhu et al. [23] proposed a sandwich approximation framework based on D-SSA method to obtain seed nodes, which achieved an approximation guarantee of \(\left( 1-\frac{1}{e}-\varepsilon \right) \). In addition, Zhu et al. [24] also proposed a sandwich approximation framework based on ED-SSA method, which approximated the upper and lower bounds of the objective function and compared them with the Group Coverage Maximization Algorithm (GCMA) to obtain the seed nodes. Although great breakthroughs have been made in GIM, many scholars are still working on more efficient algorithms.

1.2 Our Contribution

Our results can be summarized as in the following:

  • We devise a heuristic algorithm called Complementary Maximum Coverage (CMC), which emphasizes the influence of the nodes over groups to solve GIM.

  • We also propose the Improved Reverse Influence Sampling (IRIS) algorithm by adjusting the famous Reverse Influence Sampling (RIS) algorithm for GIM.

  • Compared with Maximum Coverage (MC) algorithm and Maximum Out-degree (MO) algorithm by experiments, our proposed algorithms outperform both MC and MO regarding the average number of eventually activated groups under the IC model.

1.3 Organization

The remainder of the paper is organized as below: Sect. 2 gives the social network model, and formally introduces the GIM problem; Sect. 3 presents the CMC algorithm and IRIS algorithm; Sect. 4 evaluates the four algorithms under IC model through numerical experiments; Sect. 5 concludes the paper.

2 Problem Description

2.1 Network Model

We model the social network as \(G=\left( V,\,E,\,P,\,U\right) \).

V represents the set of nodes which represent users in a social network. Assume that the social network has n users, then \(V=\left\{ v_{1},v_{2},\ldots ,v_{n}\right\} \). The node can have influence on other nodes or be influenced by other nodes, which form edges. If there’s an edge between two nodes, we could say that the two nodes are neighbors to each other.

E represents the set of edges which represent the influence between nodes. Assume that the social network has m edges, then \(E=\left\{ e_{1},e_{2},\ldots ,e_{m}\right\} \). The edge can be directed or undirected. For example, in the directed graph, \(\left( u,v\right) \) means that node u has influence on node v, but node v has no influence on node u, u is the source node and v is the target node. The edge taking u as the source node is u’s outgoing edge, and the edge taking u as the target node is u’s entry edge. The sum of the number of u’s outgoing edges is the out-degree of u, and the sum of the number of u’s entry edges is the in-degree of u.

P represents the set of probabilities which are the weights of the edges, then \(P=\left\{ p_{1},p_{2},\ldots ,p_{m}\right\} \), and \(\forall p_{i}\in \left[ 0,1\right] ,1\le i\le m\), i is a positive integer. The higher the probability is, the more likely the source node is to successfully activate the target node.

U represents the set of groups. Assume that the social network has l groups, then \(U=\left\{ u_{1},u_{2},...,u_{l}\right\} \), and \(u_{j}\) is a subset of V, \(1\le j\le l,\) j is a positive integer. In a social network, each node can be an individual or belong to one or more groups. When \(\beta \%\) of the members in a group are affected, we assume that the group is successfully affected.

2.2 Group Influence Maximization

The IM problem is to study the maximum number of nodes that will be activated with k initial active nodes under the given information diffusion model. Figure 1a is a simple social network graph without groups. Each edge is directed, indicating that the influence flows from the source node to the target node, and each edge is probabilistic.

Fig. 1.
figure 1

Examples of simple social networks with vs without groups (Color figure online)

The GIM aims to seek k nodes to maximize the expected number of eventually activated groups. Each node in GIM can be independent or belong to one or more groups, and a group will be activated only if the \(\beta \%\) members of the group are activated. The larger the value of \(\beta \) is, the more difficult it is for the group to be activated.

The IM problem is a special example of the GIM problem. Each group in GIM represents each node in IM, and \(\beta \%=100\%\). The IM problem is NP-hard. Obviously, the GIM problem is also NP-hard. For a given graph \(G=\left( V,\,E,\,P,\,U\right) \), the mathematical description of GIM is:

$$\begin{aligned} \max&\rho \left( S\right) \\ s.t.&\left| S\right| \le k \end{aligned}$$

where S is the set of seed nodes, k is the number of initial seed nodes, and \(\rho \left( S\right) \) is the expected number of groups activated by initial seed nodes under a given propagation model. It is difficult to calculate \(\rho \left( S\right) \) because the activation is probabilistic and random. In the IM problem, computing \(\rho \left( S\right) \) under IC model is \(\#\)P-hard, likewise, in the GIM problem, computing \(\rho \left( S\right) \) under IC model is also \(\#\)P-hard [24].

Obviously, activating more nodes is not the same as activating more groups. As shown in Fig. 1b, there are three groups in the social network, that is \(U=\left\{ u_{1},u_{2},u_{3}\right\} \), \(u_{1}\) is the yellow one, \(u_{2}\) is the pink one, \(u_{3}\) is the orange one. \(u_{1}=\left\{ 13,15\right\} \), \(u_{2}=\left\{ 2,4,5,6,15\right\} \), \(u_{3}=\left\{ 9,11\right\} \). The group activation threshold is assumed to be 50%, meaning that the group will be activated only when at least half of members of the group are active. For example, it is assumed that the seed node \(\left\{ 2\right\} \) finally activates \(\left\{ 2,4,15\right\} \), three nodes are activated, so \(u_{1}\) and \(u_{2}\) are activated, \(\rho \left( S\right) =2\). Another case is that seed node \(\left\{ 2\right\} \) successfully influences \(\left\{ 2,4,5,6\right\} \), four nodes are activated, but only \(u_{2}\) is activated, \(\rho \left( S\right) =1\). Therefore, activating more nodes is not the same as activating more groups. But the more nodes are activated, the more likely groups are to be activated.

3 The Algorithms for Solving GIM

In this section we discuss the algorithms to solve Group Influence Maximization (GIM) in the paper, including Complementary Maximum Coverage (CMC) algorithm and Improved Reverse Influence Sampling (IRIS) algorithm.

3.1 Complementary Maximum Coverage Algorithm

The CMC algorithm is the complementary of MC algorithm. MC algorithm aims to seek k seed nodes with maximum group coverage. However, MC algorithm does not take the contribution of nodes over groups into account. If a node covers maximum groups, but those groups require \(\beta \%\) active members to be activated, and this node is only a member of the large groups. In case the node does not activate other members of the groups, then the node makes little contribution to the groups. The idea of CMC algorithm is to treat all the nodes as seed nodes, then remove \(n-k\) seed nodes with the least influence over groups, and finally obtain k seed nodes. The influence of a node on a group is not only reflected in whether deleting the node has an impact on activating the group, but also in whether it can activate other members in the group. We use \(f_{c}\left( v_{i}\right) \) to calculate the influence of \(v_{i}\) over groups which it covers. If a node does not belong to any group, its \(f_{c}=0\). If a node covers more than one group, then \(f_{c}\) equals the sum of its influence on groups. If the node has maximum group coverage, the \(f_{c}\) of the node may be larger. We have \(f_{c}\left( v_{i}\right) \)

$$\begin{aligned} f_{c}(v_{i})=\sum _{j}\frac{a_{v_{i}}}{\left| u_{j}\right| -H_{u_{j}}+1} \end{aligned}$$
(1)

where j is the group number, \(u_{j}\) is the group covered by \(v_{i}\), \(a_{v_{i}}\) is the number of members that \(v_{i}\) activates successfully in \(u_{j}\), including \(v_{i}\) itself. All the nodes activated by \(v_{i}\) are obtained by breadth first search (BFS) method, then calculate the number of these nodes in \(u_{j}\), the result is \(a_{v_{i}}\). \(a_{v_{i}}\) measures the active degree of \(v_{i}\) in the group. The larger \(a_{v_{i}}\) is, the more members of the group can be activated, which also increases the possibility of activating the group. \(\left| u_{j}\right| \) is the total number of members of \(u_{j},\) \(H_{u_{j}}\) is the activation threshold of \(u_{j}\), and \(H_{u_{j}}=\beta \%\times \left| u_{j}\right| \). \(\left| u_{j}\right| -H_{u_{j}}\) means that \(u_{j}\) allows \(\left| u_{j}\right| -H_{u_{j}}\) nodes to be deleted, and the larger \(\left| u_{j}\right| -H_{u_{j}}\) is, the less influence \(v_{i}\) has on \(u_{j}\). Due to the denominator can’t be zero, we define the denominator to be \(\left| u_{j}\right| -H_{u_{j}}+1\).

figure a

Lemma 1

The runtime of Algorithm 1 is \(O\left( nl+n+m\right) \).

CMC performs the following operations on the nodes numbered from 1 to n: traverse l groups and find out the groups covered by each node. The runtime of the step is \(O\left( nl\right) \). Then compute the number of activated members within the groups covered by each node via BFS method, so the runtime is \(O\left( n+m\right) \). Hence, the runtime of the CMC algorithm is \(O\left( nl+n+m\right) \).

3.2 Improved Reverse Influence Sampling Algorithm

The Improved Reverse Influence Sampling (IRIS) algorithm is improved on the basis of Reverse Influence Sampling (RIS) algorithm. The RIS algorithm is divided into two steps: the process of generating RR (Reverse Reachable) sets and the process of selecting seed nodes. The first step is to randomly select node v in the original graph and traverse the entry edge of v. Each edge is inverted with the probability of p, or remains unchanged with the probability of \(1-p\). Finally, a sparse reverse graph is generated. This helps to keep the high-probability edges, allowing a wider range of propagation. Simply speaking, the set of nodes that can reach node v with high probability is the RR set of node v. To take a simple example, Fig. 2a is the original social network graph, and there are 5 nodes and 10 directed edges. Figure 2b is the sparse graph of Fig. 2a, leaving 7 edges with high probability. The RR set of node \(v_{2}\) is \(\left\{ v_{2},v_{1},v_{4},v_{5}\right\} \), where each node has high probability to activate \(v_{2}\).

Fig. 2.
figure 2

An example of a RR set generation

figure b

The second step of the RIS algorithm is to select the seed nodes covering maximum RR sets. Because covering more RR sets means affecting more nodes. The k nodes that cover most RR sets are the seed nodes we are looking for. But now the task is to activate maximum groups, not most nodes, and activating more nodes does not mean activating more groups. So we propose IRIS in order to solve GIM, we change the second step of RIS to choose k nodes that have maximum group coverage. In this way, the selected seed nodes can not only have certain propagation influence to activate more nodes, but also can activate more groups.

Fig. 3.
figure 3

Comparison of CMC, MC and MO under IC model for Dataset1

Lemma 2

The runtime of Algorithm 3 is \(O\left( \varGamma \left( n+m\right) +knl\right) \).

The IRIS algorithm first forms \(\varGamma \) random sparse graphs of G, then randomly selects a node to generate a RR set in each subgraph. The runtime of the first step is \(O\left( \varGamma \left( n+m\right) \right) \). Secondly iterate k times to select the node that covers the most groups in the RR sets. The runtime of the second step is \(O\left( knl\right) \). Thus the runtime of the IRIS algorithm is \(O\left( \varGamma \left( n+m\right) +knl\right) \).

figure c

4 Numerical Experiments

4.1 Experimental Setting

We used two data sets to perform experiments under the Independent Cascade (IC) model, including the undirected graph Dataset1 and directed graph Dataset2. Dataset1 collected in March 2020 is a social network of users from Asian (e.g. Philippines, Malaysia, Singapore) countries [25]. Nodes represent users of the music streaming service LastFM and links among them are friendships. Dataset2 consists of 9 snapshots of the Gnutella peer-to-peer file sharing network from August 2002 from SNAP. Nodes represent hosts in the Gnutella network topology and edges represent connections between the Gnutella hosts. For the convenience of the experiments, groups were randomly generated, and the probability of each edge was randomly generated. The Improved Reverse Influence Sampling (IRIS) algorithm is applied to directed graphs, so Dataset2 is available for the four algorithms, while Dataset1 is suitable to other algorithms except IRIS. Table 1 is the information of data sets used in our experiments.

Table 1. Datasets information

Because k and \(\beta \) affect the objective function. Therefore we set the value of k from 5 to 80 at an interval of 5 for Dataset1 and Dataset2. Set the value of \(\beta \) to 10 and 20 for Dataset1, set the value of \(\beta \) to 5, 8, 10, 12, 15, 18 respectively for Dataset2. All programs were written in python3.7.

4.2 Experimental Results

As can be seen from Fig. 3, when the size of k grows, the number of activated nodes increases and hence does the number of activated groups. Besides, the number of groups activated decreases when \(\beta \) grows. For the undirected graph Dataset1, CMC outperforms MC and MO in general. MO performs worst, because MO aiming to seek k nodes with maximum out-degree does not focus on group activation. The performances of CMC and MC have little difference when \(\beta =10\), because the activation threshold is small, the two algorithms can find the key nodes to activate the majority of the groups. When \(\beta =20\), the difference between the experimental results of the two algorithms is widened, because the activation threshold increases, group activation becomes difficult, and the shortcomings of MC algorithm are also revealed.

Fig. 4.
figure 4

Comparison of CMC, IRIS, MC and MO under IC model for Dataset2

As for the directed graph Dataset2, we can find that CMC and IRIS have better performance than MC and MO in average. From Fig. 4a CMC performs best, IRIS is closer to MC, but slightly better than MC. While in Fig. 4b, the gap between IRIS and MC is widening, and IRIS is closer to CMC. Because IRIS not only focuses on seeking nodes which can cover more groups, but also attaches importance to propagation influence.

As demonstrated in Fig. 4c, our CMC algorithm has the best performance among all the algorithms in almost every instance except for some rare ones. The exceptional cases happen mainly because seed nodes obtained by IRIS differ in each time and the Reverse Reachable (RR) sets computed by IRIS focus on the influence of nodes outside the group, while CMC emphasizes the influence of nodes on members within the group.

5 Conclusion

In this paper, we proposed a heuristic algorithm called Complementary Maximum Coverage (CMC) based on analyzing the influence of the nodes over the groups to ensure the task of maximizing the number of groups activated. In addition, we also presented an algorithm called Improved Reverse Influence Sampling (IRIS) which is derived via improving the famous algorithm called Reverse Influence Sampling (RIS). Through experiments, we demonstrated that both CMC and IRIS outperform Maximum Coverage (MC) and Maximum Out-degree (MO) algorithms regarding the average number of activated groups under Independent Cascade model. Further, the CMC algorithm has better performance than IRIS in most cases besides the case when \(\beta \ge 15\), while it runs significantly faster than IRIS in all instances. This indicates that CMC is the best among all the four algorithms. However, the deficiencies of CMC algorithm are that the effect is not significant when \(\beta \) is low and CMC is the third fast that it is slightly slower than MO and MC. We are currently analyzing the theoretical performance of CMC so as to provide an approximation ratio for the algorithm.